A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

Size: px
Start display at page:

Download "A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010"

Transcription

1 A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010

2 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering Science 2010 Restricted Boltzmann Machines (RBMs) are a type of Artificial Neural Network and the fundamental building blocks of Deep Belief Networks (DBNs) [1]. DBNs have been successfully applied to a number of machine learning problems [2, 3, 1]. However, the O(n 2 ) complexity of training a RBM presents a serious impediment to their use in large applications. Attempts have been made to accelerate the process using custom FPGA hardware [4, 5], but no implemenation has been demonstrated to run RBMs of nodes necessary for real world applications. This thesis builds upon a virutalized FPGA architecture presented by Ly, et al. [4] with the goal of investigating its scalability towards large RBMs. The virtualized architecture time-multiplexes the hardware resources of a single FPGA to implement large virtual RBMs. To maintain the performance gain of the custom hardware in the presence of context switches, a number of approaches were used. The architecture was ported to a faster, more modern FPGA, the data representation of reduced from 32-bits to 16-bits to increase throughput in communicaition and coarse grain parallelism was provided by extending the architecture to four FPGAs. A sequential benchmark written in C was used to test the performance of the architecture. The analysis shows a strong dependence of performance on the communication overhead between the supervising microprocessor and the hardware cores. Although very little speedup is possible with the implementation presented, this thesis provides a direction for further improvements to the architecture. ii

3 Acknowledgements I would like to express my gratitude to Professor Paul Chow for giving me the opportunity to work on this project as well as for his guidance over the course of this thesis. I would also like to thank Daniel Ly for helping to define the direction of this thesis and always being available to answer my questions. Finally, I am grateful to Chris Madill, Arun Patel, Manuel Saldaña, Geng Liu and Chu Pang for their assistance during the past year. iii

4 Contents 1 Introduction 1 2 Background Restricted Boltzmann Machine Operation Structure Alternating Gibbs Sampling Energy Learning Rules Batch Learning Summary Deep Belief Networks Methods for accelerating Restricted Boltzmann Machines Virutalized FPGA Architecture Partitioning Computational Cores Restricted Boltzmann Machine Core Energy Accumulator Core and Node Select Core Message Passing Interface Large Restricted Boltzmann Machine Architecture 19 iv

5 4.1 Investigation of Data Bit Widths Memory and Communication Considerations Data Storage Communication Overhead Extension to Four FPGAs Results and Analysis Test Methods Test Setup Test Metric Results Batch Size vs. Speedup Intrinsic RBM Size Summary Conclusion Conclusions Future Work Weight Matrix Caching Distributed Energy Accumulator Core Structure Bibliography 33 A Outline of MicroBlaze Operation 36 v

6 List of Tables 5.1 Summary of Performance Measurements vi

7 List of Figures 2.1 Structure of a 3x3 Restricted Boltzmann Machine Sigmoid Function Layout of a Deep Belief Network Weight Distribution in BRAM Structure of the Virtualized Restricted Boltzmann Machine Architecture MicroBlaze PLB Connectivity Weight distribution of eight partitions among four FPGAs Overall Layout in Four FPGA System. All MPI Ranks are interconnected Mini-batch size vs. speedup Mini-batch size vs. speedup without node calculation Mini-batch size vs. speedup for a virutal 512x512 network with intrinsic RBM sizes of 64 and vii

8 Chapter 1 Introduction The paradigm of machine learning deals with methods that allow a computer to extract complex patterns underlying data. The applications of such methods are extensive, including visual pattern recognition, speech recognition and video game artificial intelligence. One popular method of machine learning is the use of artificial neural networks (ANNs). Such networks roughly model the structure of the biological neural networks in the brain, in that they consist of many parallel simple neurons, connected together through weighted relationships. The activation of the neurons dependant on the weights and states of connected neurons determines the reaction of the network to some input. By controlling the value of the weights, the network can be trained to recognize certain patterns or features of a dataset. Many different types of ANNs exist with different network topologies, activation functions and learning algorithms. A particularly popular architecture is the Restricted Boltzmann Machine (RBM); a stochastic, generative model that has proven to perform well in problems such as face recognition [6]. Recently, it has been shown that when several RBMs are stacked together to form a Deep Belief Network (DBN), an efficient learning algorithm exists to train the entire network [1]. DBNs have the benefit of being able to learn more complex features and have been applied to problems of generating facial 1

9 Chapter 1. Introduction 2 expressions [2], semantic hashing of documents [3] and recognition of hand written digits [1]. Although the learning algorithm is relatively efficient, training the large networks required for the real-world applications above can still take several days or weeks on a general purpose desktop computer [6]. The parallel nature of the RBM architecture makes it very tractable by hardware implementations and several groups have created FPGA and GPU based RBM solutions providing much needed speed-up [7, 8, 5]. In particular, Ly et al. s FPGA architecture has produced a 145x speed-up relative to a desktop PC [7]. However, it has only implemented relatively small RBM networks of 256x256 neurons whereas real world applications require much larger networks. For example, the DBN used to recognize handwritten digits [1] contained a RBM of size 2000x510. The goal of my thesis is to scale Ly, et al. s FPGA architecture [4] up, to be capable of handling the thousands of neurons necessary in real world DBN applications while maintaining maximum performance. The main bottleneck limiting the current implementation size of the FPGA architecture is the size of the weight matrix. This data structure is necessarily large since each node must be connected through a weight to all of the nodes in the next layer due to the bipartite graph organization of the RBM. To allow for larger networks, my project will first involve adapting the FPGA architecture to a larger, faster FPGA platform. This allows for the possibility of higher clock speeds as well as better interconnects between FPGAs and thus greater performance. In addition, I will investigate the effect of decreasing the bit width of the weights. This could allow more weights to be stored on-chip and a increase in communication bandwidth between computational cores provided that the network is trainable at lower precision. Finally, by time-multiplexing the resources of four FPGAs, I hope accelerate the training performance of arbitrary size networks.

10 Chapter 2 Background 2.1 Restricted Boltzmann Machine Operation Artificial Neural Networks (ANNs) are models of biological neural networks that rely on the interactions between simple units called neurons or nodes to perform computations. By modifying connections between nodes, ANNs may be taught to model patterns in a set of training data [9]. A Restricted Boltzmann Machine(RBM) [10, 11] is a type of ANN that has become recently popular due to its role as a building block in Deep Belief Networks (DBNs). An interesting property of RBMs is they are taught to reproduce the training data they are given. The internal model they create allow them to generate new data statistically similar to the training set and thus a RBM is said to be a generative neural network Structure The RBM consists of two layers of nodes. A visible layer representing the input to the network and a hidden layer. Each node is connected to all of the nodes of the opposite layer through a weighted connection. The real valued weights on the connections are the learning parameters of the RBM and it is through their adjustment that the network 3

11 Chapter 2. Background 4 h 0 h 1 h 2 v 0 v 1 v 2 Figure 2.1: Structure of a 3x3 Restricted Boltzmann Machine. may be trained. We will denote the weight connecting visible node i to hidden node j as w i,j. The restriction that nodes of the same type are not interconnected allows for the development of a fast learning algorithm and is one of the properties that separates the RBM from general Boltzmann Machines. Generally, the node states are binary valued. However, some applications benefit from having real valued visible nodes to represent data such as greyscale images. With this topology in mind, we can write the elements of a RBM in matrix notation: w 0,0 w 0,J 1 W =..... w I 1,0 w I 1,J 1 (2.1) V = [v 0 v I 1 ] (2.2) H = [h 0 h J 1 ] (2.3) The RBM is a stochastic neural network in that its node states are determined through a probabilistic function rather than a deterministic one. The function used in a RBM is the sigmoid or logistic function [10] (Fig. 2.2). Thus, the probability of activating a node may be calculated, given that the opposite layer is determined, as the logistic function

12 Chapter 2. Background 5 1 Node Activation Probability Node Energy Figure 2.2: Sigmoid Function of its weighted inputs. P (v i = 1) = P (h j = 1) = exp( J 1 h j w i,j ) j= exp( I 1 v i w i,j ) i=0 (2.4) (2.5) Alternating Gibbs Sampling We can now describe the main operating mode of a RBM called Alternating Gibbs Sampling (AGS). Looking at Eqns. 2.4 and 2.5, the node states for one layer can be determined as long as the other layer is fixed. In the first AGS phase, the hidden layer is generated based on test data on the visible nodes; the second AGS phase reconstructs the test data by clamping the hidden nodes and stochastically finding the states of the visible nodes. This process can continue to higher order AGS phases as each layer is clamped or determined in turn. To keep track of node states we will denote the AGS as a superscript, for example V 3 would represent the visible layer node states at the third AGS phase.

13 Chapter 2. Background Energy The Restricted Boltzmann Machine draws inspiration from the Boltzmann Distribution of statistical mechanics which describes the probability distribution for a set of states in a system [12]. The state of a RBM is defined by its visible and hidden layers. Thus, given a certain configuration of visible and hidden nodes and fixing the weights, we can define an energy: I 1 J 1 E(V, H) = v i h j w i,j (2.6) i=0 j=0 From the Boltzmann Distribution: P (V, H) exp( E(V, H)) (2.7) The goal of learning in an RBM is to model the training set, this can be accomplished by modifying the weights such that we obtain a Boltzmann distribution where the probability of obtaining configurations with the training vectors is maximized. Looking at the previous equation, we see that to maximize the probability of a training vector P (V ) we need to minimize the Energy associated with its configurations. In addition, from the energy equation, we can see that for each visible or hidden node, there is an associated local energy. I 1 J 1 E(V, H) = v i E i = h j E j (2.8) E i = E j = i=0 j=0 J 1 h j w i,j (2.9) j=0 I 1 v i w i,j (2.10) i=0 The local energy for a given node is in fact the weighed sum of the states of the nodes from the opposite layer. Therefore, we can rewrite Eqns. 2.4, 2.5 as:

14 Chapter 2. Background 7 P (v i = 1) = e E i P (h j = 1) = e E j (2.11) (2.12) We can write these local energies more succinctly as members of vectors: E V = [E 0 E I 1 ] = H W T (2.13) E H = [E 0 E J 1 ] = V W (2.14) Thus the node state vectors V and H are functions of the energy vectors E V and E H respectively Learning Rules Given this concept of state energy, we can find the learning rule for a RBM by differentiating the log probability of obtaining a particular visible layer configuration: δlog(p (V )) δw i,j =< v i h j > 0 < v i h j > (2.15) Where < a > n denotes the expected value of a at the nth AGS phase. From this equation we can see that to increase the probability of training vectors, we can apply the following weight update rule: w i,j = ɛ(< v i h j > 0 < v i h j > ) (2.16) Where ɛ is the learning rate. This learning rate must be carefully controlled since large weight updates would not be able to reach the energy minima, while slow rates would take too long to reach them. One solution is to dynamically decrease the learning

15 Chapter 2. Background 8 rate during training using the process of simulated annealing [12]. Clearly, getting a sample from the infinite AGS is not feasible in computation time. However, it has been shown that we can estimate the infinite phase with a finite one; this is called contrastive divergence (CD) learning [13]. Using CD, we no longer perform gradient descent in weight space, but it has been shown to work well even with as few as three AGS phases Batch Learning To have weight updates which represent the entire set of training data, it would be best to calculate the average weight update for the entire training set before committing the change. This type of weight update is called Batch Learning. For large sets batch learning would result in long computation times between weight updates. To address this problem, we can reduce the batch size and create mini-batches to increase update rate at the expense of update precision. At the limit of one training vector per weight update, we are performing on-line learning. For a batch size of L, the learning rule becomes: w i,j = ɛ L 1 (< v i h j > 0 < v i h j > ) (2.17) L l=0

16 Chapter 2. Background Summary The following procedure describes the RBM training operation with three AGS Phases and a mini-batch size of L. 1. Apply a training vector to the visible layer. This becomes V 1 in the next step. 2. AGS Phase 1 Compute the local energy E 1 H = V 1 W and apply the logistic function to find the node states H 1 = f(e 1 H ) 3. Increment the weight update: w i,j = w i,j + (V 1 ) T H 1 4. AGS Phase 2 Compute the local energy E 2 V = H2 W T and apply the logistic function to find the node states V 2 = f(e 2 V ) 5. AGS Phase 3 Compute the local energy E 3 H = V 3 W and apply the logistic function to find the node states H 3 = f(e 3 H ) 6. Decrement the weight update: w i,j = w i,j (V 1 ) T H 1 7. Repeat steps 1-6 for each training vector in the mini-batch 8. Commit the weight update: w i,j = w i,j + ɛ L w i,j 9. Repeat steps 1-8 for each mini-batch in the training set It should be noted that the the computations of energy in each AGS phase is of complexity O(n 2 ) and the weight update computation is also O(n 2 ). This makes training RBMs with thousands of nodes a very time consuming process Deep Belief Networks The Restricted Boltzmann Machine is powerful in itself to extract features from test data. However, it becomes even more useful as a part of a Deep Belief Network (DBN).

17 Chapter 2. Background 10 h 0 h 1 h 2 RBM 2 v 0 h 0 v 1 h 1 v 2 h 2 RBM 1 v 0 v 1 v 2 Figure 2.3: Layout of a Deep Belief Network In effect, DBNs consist of multiple RBMs stacked upon each other; the hidden nodes for one layer become the visible nodes for the next as in Fig 2.3. The additional layers of hidden nodes are used to model patterns within the patterns generated by earlier layers. Thus, the DBN is able to model more complex features in data. What is interesting about these deep networks is that they can be greedily trained layer by layer using the same efficient algorithm presented above [1]. To get better classification or generative properties additional training can be performed using wake-sleep or backpropagation algorithms. 2.2 Methods for accelerating Restricted Boltzmann Machines A number of computationally intensive operations need to be performed during RBM training. In calculating the local energies, a vector-matrix multiplication must be performed as well as a matrix transposition during even AGS phases. In addition, to evaluate the node states, the non-linear logistic function must be evaluated. These operations can be slow on sequential general purpose processors. However, there have been a number of attempts to accelerate the process. Of particular interest are three published implementations: One design using the inherent parallelism in Graphics

18 Chapter 2. Background 11 Processing Units (GPUs) [8] and two custom hardware designs implemented on Field Programmable Gate Arrays (FPGAs) [5, 4]. Modern Graphics Processing Units (GPUs) offer several layers of parallelism much greater than standard multi-core CPUs allowing them to operate on large batches of data at once. In addition, optimized linear algebra packages are available for them. Raina et al. [8] used an NVIDIA GTX 280 GPU with 1GB of RAM and the CUDA Application Layer to accelerate RBM operations and build deep belief networks. On a single RBM of 4096x11008 size, they achieved a speed-up of 72.6x over a software implementation using the optimized matrix operation library Goto BLAS running on a 3.16GHz Dual- Core processor. To minimize data transfer of weights in DBNs, they developed the idea of overlapping patches. By representing the visible layer as a 2D surface and tiling patches across it, they were able to create local connections between hidden layers where the patches overlapped. Using this method, they were able to build 4-layer DBNs with 96 million parameters. However, the amount of overlapping areas decreases as the order of overlap increases, so this method is inherently limited to DBNs of decreasing size for higher layers. In addition, the layers are not fully connected with the overlapping patches method thus this implementation is limited to applications in a subset of DBN problems. Kim et al. [5] developed a hardware implementation of an RBM on an Altera Stratix III EP3SL340 FPGA. In this design the authors decided to use 16-bit fixed point words to represent the weights, energies and visible node states. The main computational cores of this design were partitioned into groups of adders and multipliers to perform the vector-matrix operation of local energy calculation. To perform the energy calculation for all of the nodes of a given layer in parallel, all of the row or column elements of the weight matrix must be available at the same time and thus must be stored on separately addressable memory elements. To avoid this problem, the authors stored each column of the weight matrix in separate memory blocks such that a single row was available at a time. This allowed the visible energies to be calculated simply using a multiplier and

19 Chapter 2. Background 12 tree adder. Then, by using an accumulator structure to calculate the hidden energies, they did not have to modify their memory structure. To compute the logistic function, a Piecewise Linear Approximate of Nonlinear function (PLAN) was implemented. When benchmarked against a software implementation running on a 2.4GHz Intel Core 2 system, they achieved a speed-up of 25x over single precision MATLAB code and 30x over double precision. The maximum network size achieved was 512x512. The final FPGA Implementation by Ly et al [4], was developed on a Berkeley Emulation Engine 2 hardware platform consisting of five interconnected Virtex-II Pro XC2VP70 FPGAs. In this design, a set of tree adders was used to calculate the visible and hidden energies. The problem of weight addressing was allievated by storing diagonal sections of the matrix in different memory blocks. In this way, the same set of memory blocks could be used to access a row or column of the weight matrix. The logistic function was performed using a Piecewise Linear Interpolator. Some significant differences with the FPGA design by Kim et. al. are that the weights and energies are represented as 32-bit fixed point numbers rather than 16-bit ones. In addition, the visible nodes can only be binary valued, whereas they are real valued in Kim et. al s design. Three different designs were presented: one on a single FPGA running a 128x128 RBM, one using coarse grain parallelism across four FPGAs to run a 256x256 RBM and one time-multiplexing the resources of a single FPGA to realize a 256x256 network. The speed-ups obtained were 61x, 145x and 32x respectively over an optimized C implementation running on a 2.8GHz Pentium 4 processor. The works described here did not use a common benchmark, so it is difficult to compare performance directly. The GPU implementation has a clear advantage in network size, but the limitations of its overlapping patches technique make it unusable for large, general DBNs. Of the two FPGA applications, the one by Ly et. al, has a clear performance advantage especially considering that it is implemented on older FPGA hardware. Notably, no designs have been published implementing real world DBN applications.

20 Chapter 3 Virutalized FPGA Architecture The work in this thesis is built on top of the Virtualized FPGA RBM architecture designed by Ly, et al. [4] In this chapter, some important aspects of the architecture will be discussed. Custom FPGA hardware cores are able to perform the compoutations involved in RBM training very quickly, but a FPGA has a finite amount of resources. Therefore, the size of the network a single FPGA can work on is limited. One way to increase the workable network size is by simply adding more FPGAs. However as the size of the application grows, this method becomes quickly cost and power prohibitive. A better approach is to time-multiplex the hardware to handle problems of almost arbitrary size. The tradeoff in this approach is that a context switch is required to work on different portions of the network. The virtualized RBM architecture that this thesis is based on uses the time-multiplexing approach to work on networks whose size would not normally fit on a single FPGA. 3.1 Partitioning To use a virtualized system for performing Restricted Boltzmann Machine operations, the computations must first be partitioned into independent work units. By partitioning 13

21 Chapter 3. Virutalized FPGA Architecture 14 the visible and hidden vectors into A and B parts respectively, the weight matrix can be broken into a group of block matrices. W 0,0 W 0,B 1 W =..... W A 1,0 W A 1,B 1 (3.1) V = [V 0 V A 1 ] (3.2) H = [H 0 H B 1 ] (3.3) The energy calculation then becomes: E H0 V 0 W 0,0 + +V A 1 W A 1,0 E H = V W =. =..... E HB 1 V 0 W 0,B 1 + +V A 1 W A 1,B 1 E V0 H 0 W 0,0 + +H B 1 W 0,B 1 E V = H W T =. =..... E VA 1 H 0 W A 1,0 + +H B 1 W A 1,B 1 (3.4) (3.5) In this configuration, the energies E Hj and E Vi required to calculate a block of node states is now divided into a number of partial energy blocks involving the calcuation V i W i,j or H j W i,j. These partial energy calculations can be done independently and later recombined to resolve the node states. Also, by partioning the weights in this way, the weight update calculations may be performed on each weight matrix W i,j independently as well.

22 Chapter 3. Virutalized FPGA Architecture Computational Cores The hardware architecture consists of three major cores. The Restricted Boltzmann Machine Core (RBMC) performs the primary vector-matrix energy calculation as well as the weight update. The Energy Accumulator Core (EAC) is used to sum the partial energies described in the previous section. Finally, the Node Select Core (NSC) evaluates the node states using the sigmoid function Restricted Boltzmann Machine Core The RBMC performs the O(n 2 ) energy calcuation (Eqns. 2.13, 2.14) and weight update (Eqn. 2.16) steps in O(n) time. To achieve this speed, a number of restrictions are applied to the RBM network. The sizes of the visible and hidden layers must be the same to allow for reuse of the same computational logic for both layers. Node states must be binary values. This condition allows the use of AND gates in place of multipliers for operations involving the node states. Weights and energies use a 32-bit fixed point representation. The choice of fixed point over floating point simplifies the arithmetic logic for energy and weight update calculations. The layer size must be a power of two. This limitation allows the use of a binary tree adder when calculating the energy. The size restrictions on the node layers do not limit the space of problems the architecture is capable of handling since unused nodes may be added to reach the next power of two. However, maximum effective performance will not be attained unless the sizes of the layers in the application match the previous descriptions. Also a 32-bit fixed point representation provides a very large range of supported values and would likely not run

23 Chapter 3. Virutalized FPGA Architecture 16 BRAM 0 W = w 0,0 w 0,1 w 0,2 w 1,0 w 1,1 w 1,2 w 2,0 w 2,1 w 2,2 BRAM 1 BRAM 2 Figure 3.1: Weight Distribution in BRAM into overflow or underflow issues unless the radix was chosen poorly. The choice of binary valued node states on the other hand does limit the number of real world applications, but the simplicity of the logic required allows for very fast computation in the set of problems the architecture can handle. With these restrictions in place, the node energies may be calculated by accessing an entire row or column of the weight matrix, performing a logical AND with the node states and feeding the result into a binary tree adder. Due to the pipelined nature of the tree adder, one energy can be produced every clock cycle, thus reducing the computational complexity to O(n). Likewise an entire row or column of weight updates may be generated in parallel by performing a logical AND between the visible and hidden node states and using the outcome to decide whether or not the learning rate should be applied to the weight updates. Clearly an important characteristic of the RBMC is its ability to access a full column or row of the weight matrix in parallel. To facilitate this, for a RBM of size nxn, n physical dual ported Block RAMs (BRAMs) are instantiated, each containing a diagonal of the weight matrix. An example for a 3x3 RBM is shown in Fig Notice that for every row and column, each weight is stored on a different BRAM. The parallel storage of weights turns out to be the limiting factor in terms of size of RBM synthesizable on a single FPGA. For weight storage, n BRAMs are required and an additional n BRAMs are required to store the weight updates.

24 Chapter 3. Virutalized FPGA Architecture Energy Accumulator Core and Node Select Core The EAC and NSC work in tandem to find the node states given the a set of partial energies. First, as a stream of partial energies arrives at the EAC, they are summed and stored in a BRAM First In First Out (FIFO) memory structure. Once the total energies has been computed, the EAC sends them to the NSC which performs node state selection using an approximated sigmoid function and a uniform random number generator. The sigmoid function is calculated using a look up table (LUT) whose output is sent through a pipelined piecewise linear interpolator (PLI) in order to get a better estimate. Once the node states have been determined, they are sent back to the EAC and from there back to the source of the partial energies. 3.3 Message Passing Interface All three cores performing different parts of the RBM calculations must be connected to each other as well as a supervising microprocessor. In order to provide a simple, high bandwidth, communication channel TMD-MPI [14] was used to connect the cores. TMD-MPI implements a subset of the Message Passing Interface (MPI) standard for embedded systems. This communication layer offers a level of abstraction away from the implementation of the computational cores. Data is sent point to point as through a network as packets called messages. Each message has a defined source, destination, tag and word count where words are 32-bit pieces of data. At initialization, each device on the MPI network is given a specific address called a rank which is used to route packets through the network. When data is recieved it is stored in a message queue. Once a hardware core begins to read the message, a new word of the data is available each clock cycle. This operation allows the cores to operate asynchornously and yet still have high bandwidth communication between each other. The microprocessor has its own Message Passing Engine (MPE) which supports direct

25 Chapter 3. Virutalized FPGA Architecture 18 PPC R0 R3 EAC RBMC R1 R2 NSC Figure 3.2: Structure of the Virtualized Restricted Boltzmann Machine Architecture memory access (DMA) and burst access to memory. These features allow a minimal overhead from the processor since only four 32-bit words must be sent to the MPE before it may begin streaming data. The MPI connectivity of the full system is shown in Fig The circles represent the MPI hardware and show the ranks of each computing element. The platform used in [4] was a Berkeley Emulation Engine 2 (BEE2) [15] with five Virtex-II Pro XC2VP70 FPGAs in a communication mesh and a hard PowerPC (PPC) processor.

26 Chapter 4 Large Restricted Boltzmann Machine Architecture The goal of this thesis was to investigate the scalability of the FPGA architecture discussed in the previous chapter and adapt it to handle the size of Restricted Boltzmann Machines required in real world Deep Belief Network applications. In this chapter, the design modifications to the existing architecture will be discussed. One of the first steps taken to increase the performance of the virtualized FPGA architecture was to move the design to a more modern FPGA. The BEECube Berkeley Emulation Engine 3 (BEE3) [16] hardware platform was chosen as a logical upgrade from the Virtex II based BEE2. The BEE3 contains four Xilinx Virtex-5 5VLX155T FPGAs connected in a ring, each with acces to up to 16GB of DDR2 external RAM. The only major design change during this transition was a switch from a PowerPC processor managing the hardware cores to soft MicroBlaze processors. 4.1 Investigation of Data Bit Widths The existing Virtualized Restrcted Boltzmann Machine architecture represented weight and energy values as 32-bit fixed-point numbers. This bit width was a convinient design 19

27 Chapter 4. Large Restricted Boltzmann Machine Architecture 20 choice since the MPI hardware operated with 32-bit data widths and the configurable dual ported BRAMs supported up to 36-bit data width. However, significant performance improvements may be realized with a reduction in bit width. For example, given the 32- bit width of the MPI channel, a bit width of 16-bits would result in double the amount of weights or energies transferred per clock cycle since two of them could be packed into each MPI word. A width of 8-bits would allow four times the number of weights or energies transferred. Since the weights must be transferred during every context switch, this gain in throughput becomes significant in the Virtualized RBM architecture. With data packing, the operation of the EAC and NSC may also be parallelized to add multiple energies and calculate multiple node states per clock cycle. Finally, a RBM of size n requires 2n physical dual ported BRAMs to store the weight and weight update matrices on the RBMC. If the BRAMs on the FPGA may be split into smaller width, but more plentiful dual ported BRAMs the size of RBM that can be synthesized on a single FPGA may also be increased. This would significantly increase the performance of the RBMC. The drawback of using fewer bits to represent data is the reduction in the range of possible values. Depending on the RBM application, there exists the possibility of overflow or underflow. This could lead to problems finding a set of values in weight space to accurately represent the given training set. To roughly estimate the effect of using different data widths, a simple experiment was carried out. A RBM was trained in software with three different signed fixed-point representations: 32-bit with 8 magnitude bits, 16-bit with 8 magnitude bits and 8-bit with 4 mangitude bits. The network of size 1024x512 was trained for 100 epochs to recognize an image of the number 0. As a comparison metric, the trained networks were fed back the training image and AGS was run for 1025 phases. If the weights had been set properly, the network would be able to reproduce the image faithfully. Fig 4.1 shows the average number of errors found in a bitwise comparison of the original image versus the reconstructed one over ten attempts. The reconstruction with 16-bit weights produced a result similar to the case

28 Chapter 4. Large Restricted Boltzmann Machine Architecture 21 Bit Width vs. Average Reconstruction Error Average Reconstruction Error Weight and Energy Bit Width with 32-bit weights while the network with 8-bit weights failed to reproduce the image at all. Previous studies [17, 5] on data width reinforce these results and suggest that 16-bits is adequate for many neural network applications. Thus, this project uses 16-bit representations for weight and energy values. Due to the choice of 16-bit widths, two energies are packed in each word transmitted over MPI, the EAC and NSC were modified to perform the energy summation and node state calcuation in parallel for both incomming energies. The Virtex-5 family Block RAMs are 36Kbit dual ported modules configurable in a number of width and depth settings. Each BRAM may also be configured as two independent 18Kbit dual ported modules. In addition, both the 36Kbit and 18Kbit BRAMs may be configured in simple dual-port mode in which there is a single dedicated read port and a single dedicated write port. In this configuration, the 36Kbit BRAM width is doubled to 72 bits and the 18Kbit BRAM width is doubled to 36 bits.[18] The 5VLX155T FPGA has Kbit BRAMs and thus Kbit BRAMs available. Therefore the maximum RBM which can be implemented on the FPGA is 128x128 using Kbit BRAMs.[19] The RBMC only requires a single read port and single write port for each weight storage BRAM, thus the maximum RBMC is 128x128 with both 32-bit data widths and 16-bit data widths.

29 Chapter 4. Large Restricted Boltzmann Machine Architecture 22 μb MPMC PLB BRAM PLB MPE FPGA 0 DRAM Figure 4.1: MicroBlaze PLB Connectivity However, given the improvements in the communication performance from the reduction in bit widths is still beneficial to overall performance. 4.2 Memory and Communication Considerations Data Storage A weight matrix for a RBM of size 128x128 with 16-bit weight representation requires 32 KBytes of memory to store. As we increase the RBM network size, the weight matrix becomes quadratically larger and quickly exceeds the storage resources on a FPGA. In addition, when training real world applications the number of training vectors can become very large. For example, the MNIST database with images was used while training a network to recognize handwritten numbers[1]. In order to store all of this data off chip DDR2 RAM was used. Fig 4.1 shows the local connections to the MicroBlaze processor. Data is streamed through the processor local bus (PLB) from the DDR2 RAM to the multip ported memory controller (MPMC) and finally to the PLB MPE core after which it is distributed to the appropriate computational core through the MPI network.

30 Chapter 4. Large Restricted Boltzmann Machine Architecture Communication Overhead Since much of the data must be located external to the FPGA, additional latency is introduced during transfers between the MicroBlaze and the computational cores. The MPE core is designed with direct memory access (DMA) and is able to perform burst writes and reads to and from the external memory through the PLB. In principle this is fast, especially for large blocks of data such as the weight matrices. The only overhead from the MicroBlaze processor is the transmission of four words to set up the MPE core. However, when transferring smaller batches of data this overhead becomes very significant since the MicroBlaze is slow compared to the hardware cores. In particular, node states for a 128x128 system are only four 32-bit words long themselves and a set of energies is only 64 words long. Therefore operations heavily involving these elements such as the node state calculation are subject to a significant performance hit. In addition to the overhead from the MicroBlaze, another significant performance reduction comes from the context switch operation itself. Although the transmission of weight matrices is relatively fast given DMA, bursting and the two packed 16-bit elements in each MPI word, the communication time represents a significant period where the RBMC is idle. One simple way to address this problem is by increasing the mini-batch size. This allows the weight matrix to remain on the RBMC for a longer period of time and as batch size increases, theoretically the computation time of the RBMC would eventually become the limiting factor. One key drawback to using larger batch sizes is the need to store more partial energies before node state calculations may occur. In this particular implemention since the energies are stored on large external DDR RAM, this does not have a significant effect.

31 Chapter 4. Large Restricted Boltzmann Machine Architecture 24 W = W 0,0 W 0,1 W 0,2 W 0,3 W 0,4 W 0,5 W 0,6 W 0,7 W 1,0 W 1,1 W 1,2 W 1,3 W 1,4 W 1,5 W 1,6 W 1,7 W 2,0 W 2,1 W 2,2 W 2,3 W 2,4 W 2,5 W 2,6 W 2,7 W 3,0 W 3,1 W 3,2 W 3,3 W 3,4 W 3,5 W 3,6 W 3,7 W 4,0 W 4,1 W 4,2 W 4,3 W 4,4 W 4,5 W 4,6 W 4,7 W 5,0 W 5,1 W 5,2 W 5,3 W 5,4 W 5,5 W 5,6 W 5,7 W 6,0 W 6,1 W 6,2 W 6,3 W 6,4 W 6,5 W 6,6 W 6,7 W 7,0 W 7,1 W 7,2 W 7,3 W 7,4 W 7,5 W 7,6 W 7,7 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Figure 4.2: Weight distribution of eight partitions among four FPGAs 4.3 Extension to Four FPGAs Additional performance was obtained by using the four FPGAs available on the BEE3 platform to provide coarse grain parallelism. Since communication links between FPGAs are not as fast as on-chip links, minimizing the inter-fpga communication was essential to maintaining performance. In addition, it was important to ensure that the work load was shared evenly among the FPGAs. From these two conditions, the weight matrices representing the largest data transfer, were assigned to FPGAs where they were just streamed locally in and out of DDR2 RAM. The partioning of the matrices shown in Fig. 4.2 is similar to the weight distribution within the RBMC from Fig The one difference being that there may be fewer FPGAs than weight matrices and thus multiple sets of calculations may be required to get all the partial energies for a set of nodes. This structure allows all of the FPGAs to work together computing either a set of visible or hidden nodes at once. The overall system layout is shown in Fig In this configuration, all of the partial energies from each FPGA must be sent to a single location to be added together and the node states must then be distributed from that single source back to the rest of the FPGAs. Given the operation of the EAC, this was the simplest method of connectivity. However it results in a communication bottleneck that becomes more significant as FP-

32 Chapter 4. Large Restricted Boltzmann Machine Architecture 25 RAM μb RBMC RAM μb RBMC R0 R1 R4 R5 R3 R2 EAC NSC FPGA 0 FPGA 1 RAM μb RBMC RAM μb RBMC R8 R9 R6 R7 FPGA 3 FPGA 2 Figure 4.3: Overall Layout in Four FPGA System. All MPI Ranks are interconnected. GAs are added. In addition, a key limitation in this implementation is network size. A bug exists in the EAC in which it ceases operation when more than ten partial energies are delivered to it. Due to time constraints, this bug was not addressed for this thesis. Therefore, the maximum network size implementable by this system is 8n where n is the size of RBM synthesized on a single FPGA. An outline of the code run on each FPGA is provided in Appendix A.

33 Chapter 5 Results and Analysis 5.1 Test Methods Test Setup The design was tested on the BEE3 platform with all computational cores and MicroBlaze soft processors running at 100MHz. In addition, an external 2GB DDR2-667 RDIMM module running at 200MHz was connected through a multi ported memory controller (MPMC) to the processor local bus (PLB) of the MicroBlaze processor. In hardware, two different network sizes were synthesized: 64x64 and 128x128. Virtual network sizes of 1024x1024 and 512x512 were tested on the 128x128 system while only 512x512 was tested on the 64x64 system. The f max of the 128x128 system reported by Xilinx Synthesis Tool (XST) was MHz. However, due to time constraints, system clock frequencies greater than 100MHz were not explored. In testing the effect of various batch sizes on performance, batches of 1,8,16,32,64 and 84 were run. A sequential implementation written in C was used as a basis for comparison for relative speedup. The equivalent software versions of the hardware components were written such that the output of the software benchmark matched the output of the hardware implementation. The C implemenation was compiled using gcc version

34 Chapter 5. Results and Analysis 27 with optimization level 2. The benchmark was run on a Intel Core 2 Duo E8400 at 3GHz on a 32-bit version of Ubuntu Linux running kernel To record the computation time, the function gettimeofday() was used on the software implementation. The results of 25 runs was averaged to get the final computation time. For the hardware implementation, the function MPI TIME() was used and the results were averaged over 10 runs Test Metric Although relative speedup is an interesting measure, it is difficult to compare different architectures without an absolute measure of performance. One popular method of measuring neural network training performance is Connection Updates per Second (CUPS) [20]. This is defined as the number of weight updates per second or CUP S = n2 T (5.1) Where n is the size of node layers and T is the amount of time for all of the weights to be updated for one test vector. The speedup over the sequential C implementation was taken to be the ratio of Connection Updates per Second of the hardware implementation and that of the software implementation. S = CUP S h CUP S s (5.2) 5.2 Results Batch Size vs. Speedup As previously mentioned, one of the major concerns with scaling a RBM architecture is the O(n 2 ) growth in the number of context switches and thus data transfers. Fig.

35 Chapter 5. Results and Analysis 28 8 Batch Size vs. Speedup 6 Speedup x x Mini-batch size Figure 5.1: Mini-batch size vs. speedup shows the effect of increasing batch size on overall performance. Both network sizes plotted were implemented on a system with 128x128 intrinsic core size. From Fig , we can see that as batch size increases the larger 1024x1024 system becomes increasingly faster than the 512x512 network. We can infer from this result that at a batch size of one, the weight transfers consume a vast majority of the computation time. As the network size is increased, the RBMC and EAC/NSC operations begin to take up a greater percentage of time and the O(n) benefits of those cores become apparent relative to the O(n 2 ) software baseline. If the RBM training time was limited purely by the RBMC or EAC/NSC computation times, we would expect to see a continued increase in speedup with batch size. However, the plots level off fairly quickly. One other operation that keeps the RBMC idle apart from the weight transfer is the node state calculation. Particularly in this architecture, the partial energies must be transferred from all of the FPGAs to one point and the node states must then be streamed back. This operation must be done synchronously between FPGAs thus, some overhead in setting up the timings between FPGAs is required and the computation must wait for the most delayed FPGA to be ready.

36 Chapter 5. Results and Analysis 29 Batch Size vs. Speedup Without Node Selection 15 Speedup x x Mini-batch size Figure 5.2: Mini-batch size vs. speedup without node calculation As an artificial experiment, the same test was run but without the node calculation stage. Here, the speedup increases noticebly beyond the last test, but still begins to taper off before reaching very high performance. Since from [4], the hardware cores themselves are very fast, this is likely due once again to communication bottlenecks between the RAM and the compute cores. From these results, we can see that the communication overhead due to transfers between the MicroBlaze and the hardware cores limit the overall system performance. Since a number of communications increase as O(n 2 ), this has significant implications Intrinsic RBM Size A second performance factor measured was the effect of changing the intrinsic network size n. A reduction in n would degrade the performance benefit of the O(n) compute cores and require more context switches for the same virtualized RBM size. However it would also reduce the overall transfer time of data between computations. Fig shows a comparison of running a virtual 512x512 RBM on both n = 64 and n = 128 hardware over varying batch sizes. From this plot we can see that increasing the intrinsic

37 Chapter 5. Results and Analysis x512 Batch Size vs. Speedup for n 4 Speedup n = 128 n = Mini-batch size Figure 5.3: Mini-batch size vs. RBM sizes of 64 and 128 speedup for a virutal 512x512 network with intrinsic size of RBM is very beneficial above small batch sizes. 64x64 vs 128x Summary The absolute CUPS results are summarized in table It is interesting to note how CUPS is approximately the same for batch size 1 and n = 128 regardless of virutalized size. This represents a O(n 2 ) relationship. Platform RBM Size Batch Size MCUPS Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Virtualized FPGA n = x Table 5.1: Summary of Performance Measurements

38 Chapter 6 Conclusion 6.1 Conclusions The purpose of this thesis was to investigate the scalability of Ly, et al. s [4], virtualized FPGA architecture. The primary impediment to using the architecture for large networks was the communication overhead required in context switching. As the size of the RBM grows linearly, the number of transfers increases as O(n 2 ). To maintain performance, the architecture was ported to a faster, more modern FPGA platform, the BEE3. The data representation was also changed from 32-bits to 16-bits in order to reduce the time required to transfer a set of energies or weights. Finally, the system was implemented four FPGAs in order to provide some extra coarse grain parallelsim. When compared to a sequential O(n 2 ) C benchmark, the results showed several different communication overhead problems. The speed at low batch sizes was limited by the weight transfers during AGS phases. At higher batch sizes, the transfer of data to the EAC became the bottleneck and if that was removed, additional overhead reduced performance before a good speedup could be observed. The design presented in this thesis only achieves a small speedup over software at high batch sizes. However, the analysis provided may be used as a basis for further improvements to the architecture. 31

39 Chapter 6. Conclusion Future Work The architecture presented in this thesis still has a great deal of room for improvement. Primarily, a reduction in the significant communication overheads present in the virtualized system would allow the system to more fully utilize the computational cores Weight Matrix Caching The transfer of weights during the context switches is a significant performance bottleneck of this architecture. As shown in this work, the effect of the context switch can be partially allieviated by using large batch sizes. However during the transfer from external DDR2 RAM, the RBMC is still inactive. To reduce the transfer latency, the next weight matrix to be processed may be cached in a compact structure within the RBMC. Another possibility is to use the leftover depth of the weight storage BRAMs to cache multiple weight matrices. In the 128x bit architecture, only 128 out of 1K elements of the 18Kbit DRAMs are being used; by making use of the independent write port, additional weight matrices may be loaded while the RBMC performs other calcuations Distributed Energy Accumulator Core Structure Another method of reducing communication overhead is to improve the operation of the EAC. If more FPGAs are added to the system, the single EAC point in the current architecture will become increasingly bottlenecked. In order for the performance of the architecture so scale well when implemented on many FPGAs, the calculation of the node states should be distributed in a tree or ring fashion. This would reduce the communication bottleneck as well as improve the node selection time in the case of the tree structure.

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

2. Accelerated Computations

2. Accelerated Computations 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Artificial Neural Network and Fuzzy Logic

Artificial Neural Network and Fuzzy Logic Artificial Neural Network and Fuzzy Logic 1 Syllabus 2 Syllabus 3 Books 1. Artificial Neural Networks by B. Yagnanarayan, PHI - (Cover Topologies part of unit 1 and All part of Unit 2) 2. Neural Networks

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

Reservoir Computing with Stochastic Bitstream Neurons

Reservoir Computing with Stochastic Bitstream Neurons Reservoir Computing with Stochastic Bitstream Neurons David Verstraeten, Benjamin Schrauwen and Dirk Stroobandt Department of Electronics and Information Systems (ELIS), Ugent {david.verstraeten, benjamin.schrauwen,

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Srinjoy Das Department of Electrical and Computer Engineering University of California, San Diego srinjoyd@gmail.com Bruno Umbria

More information

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Byung-Gook Park Inter-university Semiconductor Research Center & Department of Electrical and Computer Engineering Seoul National

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Deep Convolutional Neural Network Based on Nested Residue Number System A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan Outline Background Deep convolutional neural network

More information

Case Studies of Logical Computation on Stochastic Bit Streams

Case Studies of Logical Computation on Stochastic Bit Streams Case Studies of Logical Computation on Stochastic Bit Streams Peng Li 1, Weikang Qian 2, David J. Lilja 1, Kia Bazargan 1, and Marc D. Riedel 1 1 Electrical and Computer Engineering, University of Minnesota,

More information

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch Psychology 452 Week 12: Deep Learning What Is Deep Learning? Preliminary Ideas (that we already know!) The Restricted Boltzmann Machine (RBM) Many Layers of RBMs Pros and Cons of Deep Learning Course Structure

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

ARestricted Boltzmann machine (RBM) [1] is a probabilistic 1 Matrix Product Operator Restricted Boltzmann Machines Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong chencong@eee.hku.hk, k.batselier@tudelft.nl, cyko@eee.hku.hk, nwong@eee.hku.hk arxiv:1811.04608v1

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture Rajeevan Amirtharajah University of California, Davis Outline Announcements Review: PDP, EDP, Intersignal Correlations, Glitching, Top

More information

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

7.1 Basis for Boltzmann machine. 7. Boltzmann machines 7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is

More information

Efficient random number generation on FPGA-s

Efficient random number generation on FPGA-s Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 313 320 doi: 10.14794/ICAI.9.2014.1.313 Efficient random number generation

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA Z. Kincses * Z. Nagy P. Szolgay * Department of Image Processing and Neurocomputing, University of Pannonia, Hungary * e-mail:

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA Quantum Artificial Intelligence and Machine : The Path to Enterprise Deployments Randall Correll randall.correll@qcware.com +1 (703) 867-2395 Palo Alto, CA 1 Bundled software and services Professional

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural 1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

More information

New Delhi & Affiliated to VTU, Belgavi ) Oorgaum, Kolar Dist ,Karnataka

New Delhi & Affiliated to VTU, Belgavi ) Oorgaum, Kolar Dist ,Karnataka Design and Implementation of Logic Gates and Adder Circuits on FPGA Using ANN Neelu Farha 1., Ann Louisa Paul J 2., Naadiya Kousar L S 3., Devika S 4., Prof. Ruckmani Divakaran 5 1,2,3,4,5 Department of

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Jeff Clune Assistant Professor Evolving Artificial Intelligence Laboratory Announcements Be making progress on your projects! Three Types of Learning Unsupervised Supervised Reinforcement

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Logic. Combinational. inputs. outputs. the result. system can

Logic. Combinational. inputs. outputs. the result. system can Digital Electronics Combinational Logic Functions Digital logic circuits can be classified as either combinational or sequential circuits. A combinational circuit is one where the output at any time depends

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Deep Learning Architecture for Univariate Time Series Forecasting

Deep Learning Architecture for Univariate Time Series Forecasting CS229,Technical Report, 2014 Deep Learning Architecture for Univariate Time Series Forecasting Dmitry Vengertsev 1 Abstract This paper studies the problem of applying machine learning with deep architecture

More information

Delayed and Higher-Order Transfer Entropy

Delayed and Higher-Order Transfer Entropy Delayed and Higher-Order Transfer Entropy Michael Hansen (April 23, 2011) Background Transfer entropy (TE) is an information-theoretic measure of directed information flow introduced by Thomas Schreiber

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Neural Networks Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Brains as Computational Devices Brains advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR) Generative Models Wrap-up Deep Learning Module Lecture Generative

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations On the Use of a Many core Processor for Computational Fluid Dynamics Simulations Sebastian Raase, Tomas Nordström Halmstad University, Sweden {sebastian.raase,tomas.nordstrom} @ hh.se Preface based on

More information

Hardware Acceleration of the Tate Pairing in Characteristic Three

Hardware Acceleration of the Tate Pairing in Characteristic Three Hardware Acceleration of the Tate Pairing in Characteristic Three CHES 2005 Hardware Acceleration of the Tate Pairing in Characteristic Three Slide 1 Introduction Pairing based cryptography is a (fairly)

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE 4: Linear Systems Summary # 3: Introduction to artificial neural networks DISTRIBUTED REPRESENTATION An ANN consists of simple processing units communicating with each other. The basic elements of

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Area-Time Optimal Adder with Relative Placement Generator

Area-Time Optimal Adder with Relative Placement Generator Area-Time Optimal Adder with Relative Placement Generator Abstract: This paper presents the design of a generator, for the production of area-time-optimal adders. A unique feature of this generator is

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information