GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU

Size: px
Start display at page:

Download "GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU"

Transcription

1 GPU-to-GPU and Host-to-Host Multipattern Sing Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL {xzha, sahni}@cise.ufl.edu Absact We develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore sing matching algorithms for the o cases GPU-to-GPU and host-to-host. For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze o sategies to communicate beeen the host and the GPU and show that one is optimal with respect to run time while the other requires less device memory. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho- Corasick GPU adaptation achieves a speedup beeen 8.5 and 9.5 relative to a single-thread CPU implementation and beeen 2.4 and 3.2 relative to the best multithreaded implementation. For the host-tohost case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged beeen 0.74 and Early versions of our multipattern Boyer- Moore adaptations ran 7% to 10% slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer- Moore codes further. Keywords: Multipattern sing matching, Aho- Corasick, multipattern Boyer-Moore, GPU, CUDA. 1 INTRODUCTION In multipattern sing matching, we are to report all occurrences of a given set or dictionary of patterns in a target sing. Multipattern sing matching arises in a number of applications including neork inusion detection, digital forensics, business analytics, and natural language processing. For example, the popular opensource neork inusion detection system Snort [22] has a dictionary of several thousand patterns that are matched against the contents of Internet packets and the open-source file carver Scalpel [18] searches for all occurrences of headers and footers from a dictionary of about 40 header/footer pairs in disks that are many gigabytes in size. In both applications, the performance of the multipattern matching engine is paramount. In This research was supported, in part, by the National Science Foundation under grants and CNS the case of Snort, it is necessary to search for thousands of patterns in relatively small packets at Internet speed while in the case of Scalpel we need to search for tens of patterns in hundreds of gigabytes of disk data. Snort [22] employs the Aho-Corasick [1] multipattern search method while Scalpel [18] uses the Boyer-Moore single pattern search algorithm [5]. Since Scalpel uses a single pattern search algorithm, its run time is linear in the product of the number of patterns in the pattern dictionary and the length of the target sing in which the search is being done. Snort, on the other hand, because of its use of an efficient multipattern search algorithm has a run time that is independent of the number of patterns in the dictionary and linear in the length of the target sing. Several researchers have attempted to improve the performance of multising matching applications through the use of parallelism. For example, Scarpazza et al. [19], [20] port the deterministic finite automata version of the Aho-Corasick method to the IBM Cell Broadband Engine (CBE) while Zha et al. [28] port a compressed form of the non-deterministic finite automata version of the Aho-Corasick method to the CBE. Jacob et al. [12] port Snort to a GPU. However, in their port, they replace the Aho-Corasick search method employed by Snort with the Knuth-Morris-Pratt [13] singlepattern matching algorithm. Specifically, they search for 16 different patterns in a packet in parallel employing 16 GPU cores. Huang et al. [11] do neork inusion detection on a GPU based on the multipattern search algorithm of Wu and Manber [27]. Smith et al. [21] use deterministic finite automata and extended deterministic finite automata to do regular expression matching on a GPU for inusion detection applications. Marziale et al. [14] propose the use of GPUs and massive parallelism for in-place file carving. However, Zha and Sahni [29] show that the performance of an in-place file carver is limited by the time required to read data from the disk rather than the time required to search for headers and footers (when a fast multipattern matching algorithm is

2 2 used). Hence, by doing asynchronous disk reads, the pattern matching time is effectively overlapped by the disk read time and the total time for the in-place carving operation equals that of the disk read time. Therefore, this application cannot benefit from the use of a GPU to accelerate pattern matching. Our focus in this paper is accelerating the Aho- Corasick and Boyer-Moore multipattern sing matching algorithms through the use of a GPU. A GPU operates in aditional master-slave fashion (see [17], for example) in which the GPU is a slave that is attached to a master or host processor under whose direction it operates. Algorithm development for master-slave systems is affected by the location of the input data and where the results are to be left. Generally, four cases arise [24], [25], [26] as below. 1) Slave-to-slave. In this case the inputs and ouuts for the algorithm are on the slave memory. This case arises, for example, when an earlier computation produced results that were left in slave memory and these results are the inputs to the algorithm being developed; further, the results from the algorithm being developed are to be used for subsequent computation by the slave. 2) Host-to-host. Here the inputs to the algorithm are on the host and the results are to be left on the host. So, the algorithm needs to account for the time it takes to move the inputs to the slave and that to bring the results back to the host. 3) Host-to-slave. The inputs are in the host but the results are to be left in the slave. 4) Slave-to-host. The inputs are in the slave and the results are to be left in the host. In this paper, we address the first o cases only. In our context, we refer to the first case (slave-to-slave) as GPU-to-GPU. The remainder of this paper is organized as follows. Section 2 inoduces the NVIDIA Tesla architecture. In Sections 3 and 4, respectively, we describe the Aho-Corasick and Boyer-Moore multipattern matching algorithms. Sections 5 and 6 describe our GPU adaptation of these matching algorithms for the GPUto-GPU and host-to-host cases. Section 7 discusses our experimental results and we conclude in Section 8. 2 THE NVIDIA TESLA ARCHITECTURE Figure 1 gives the architecture of the NVIDIA GT200 Tesla GPU, which is an example of NVIDIA s general purpose parallel computing architecture CUDA (Compute Unified Driver Architecture) [7]. This GPU comprises 240 scalar processors (SP) or cores that are organized into 30 seaming multiprocessors (SM) each comprised of 8 SPs. Each SM has 16KB of on-chip shared memory, bit registers, and constant and texture cache. Each SM supports up to 1024 active threads. There also is 4GB of global or device memory that is accessible to all 240 SPs. The Tesla, like other GPUs, operates as a slave processor to an attached host. In our experimental Fig. 1. NVIDIA GT200 Architecture [23] setup, the host is a 2.8GHz Xeon quad-core processor with 16GB of memory. A CUDA program typically is a C program written for the host. C extensions supported by the CUDA programming environment allow the host to send and receive data to/from the GPU s device memory as well as to invoke C functions (called kernels) that run on the GPU cores. The GPU programming model is Single Insuction Multiple Thread (SIMT). When a kernel is invoked, the user must specify the number of threads to be invoked. This is done by specifying explicitly the number of thread blocks and the number of threads per block. CUDA further organizes the threads of a block into warps of 32 threads each, each block of threads is assigned to a single SM, and thread warps are executed synchronously on SMs. While thread divergence within a warp is permitted, when the threads of a warp diverge, the divergent paths are executed serially until they converge. A CUDA kernel may access different types of memory with each having different capacity, latency and caching properties. We summarize the memory hierarchy below. Device memory: 4GB of device memory are available. This memory can be read and written directly by all threads. However, device memory access entails a high latency (400 to 600 clock cycles). The thread scheduler attempts to hide this latency by scheduling arithmetics that are ready to be performed while waiting for the access to device memory to complete [7]. Device memory is not cached. Constant memory: Constant memory is read-only memory space that is shared by all threads. Constant memory is cached and is limited to 64KB. Shared memory: Each SM has 16KB of shared memory. Shared memory is divided into 16 banks of 32- bit words. When there are no bank conflicts, the threads of a warp can access shared memory as fast as they can access registers [7]. Texture memory: Texture memory, like constant memory, is read-only memory space that is accessible to all threads using device functions called texture fetches. Texture memory is initialized at the

3 3 host side and is read at the device side. Texture memory is cached. Pinned memory (also known as Page-Locked Host Memory): This is part of the host memory. Data ansfer beeen pinned and device memory is faster than beeen pageable host memory and device memory. Also, this data ansfer can be done concurrent with kernel execution. However, since allocating part of the host memory as pinned reduces the amount of physical memory available to the operating system for paging, allocating too much page-locked memory reduces overall system performance [7]. 3 THE AHO-CORASICK ALGORITHM There are o versions nondeterministic and deterministic of the Aho-Corasick (AC) [1] multipattern matching algorithm. Both versions use a finite state machine/automaton to represent the dictionary of patterns. In the deterministic version (DFA), which is the version we use in this paper, each state of the finite automaton has a well-defined ansition for every character in the alphabet as well as a list of matched patterns. The search starts with the automaton start state designated as the current state and the first character in the text sing, S, that is being searched designated as the current character. At each step, a state ansition is made by examining the current character of S. A ansition to the state corresponding to the current character is made and the next character of S becomes the current character. Whenever a state ansition is made, the patterns in the list of matched patterns for the reached state are ouut along with the position in S of the current character. This ouut is sufficient to identify all occurrences, in S, of all dictionary patterns. Aho and Corasick [1] show how to compute the DFA for a set of patterns. The number of state ansitions made by the DFA when searching for matches in a sing of length n is n. 4 MULTIPATTERN BOYER-MOORE ALGO- RITHM The Boyer-Moore pattern matching algorithm [5] was developed to find all occurrences of a pattern P in a sing S. This algorithm begins by positioning the first character of P at the first character of S. This results in a pairing of the first P characters of S with characters of P. The characters in each pair are compared beginning with those in the rightmost pair. If all pairs of characters match, we have found an occurrence of P in S and P is shifted right by 1 character (or by P if only nonoverlapping matches are to be found). Otherwise, we stop at the rightmost pair (or first pair since we compare right to left) where there is a mismatch and use the bad character function for P to determine how many characters to shift P right before re-examining pairs of characters from P and S for a match. More specifically, the bad character function for P gives the distance from the end of P of the last occurrence of each possible character that may appear in S. In practice, many of the shifts in the bad character function of a pattern are close to the length, P, of the pattern P making the Boyer- Moore algorithm a very fast search algorithm. In fact, when the alphabet size is large, the average run time of the Boyer-Moore algorithm is O( S / P ). Galil [9] has proposed a variation for which the worst-case run time is O( S ). Horspool [10] proposes a simplification to the Boyer-Moore algorithm whose performance is about the same as that of the Boyer-Moore algorithm. Several multipattern extensions to the Boyer-Moore search algorithm have been proposed [2], [4], [8], [27]. All of these multipattern search algorithms extend the basic bad character function employed by the Boyer- Moore algorithm to a bad character function for a set of patterns. This is done by combining the bad character functions for the individual patterns to be searched into a single bad character function for the entire set of patterns. The combined bad character function B for a set of p patterns has B(c) = min{b i (c),1 i p} for each character c in the alphabet. Here B i is the bad character function for the ith pattern. The Set-wise Boyer-Moore algorithm of [8] performs multipattern matching using this combined bad function. The multipattern search algorithms of [2], [4], [27] employ additional techniques to speed the search further. The average run time of the algorithms of [2], [4], [27] is O( S /minl), where minl is the length of the shortest pattern. The multipattern Boyer-Moore algorithm used by us is due to [4]. This algorithm employs o additional functions shift1 and shift2 and a ie, called the reverse ie, which contains the reverse of the patterns. 5 GPU-TO-GPU 5.1 Sategy The input to the multipattern matcher is a character array input and the ouut is an array ouut of states or reverse-ie node indexes (or pointers). Both arrays reside in device memory. When the Aho-Corasick (AC) algorithm is used,ouut[i] gives the state of the AC DFA following the processing of input[i]. Since every state of the AC DFA contains a list of patterns that are matched when this state is reached, ouut[i] enables us to determine all matching patterns that end at input character i. When the multipattern Boyer-Moore (mbm) algorithm is used, ouut[i] is the last reverse ie node visited over all examinations of input[i]. Using this information and the pattern list stored in the ie node we may determine all pattern matches that begin at input[i]. If we assume that the number of states in the AC DFA as well as the number of nodes in the mbm reverse ie is no more than 65536, a state/node index can be encoded using

4 4 n maxl S block B T S thread number of characters in sing to be searched length of longest pattern number of input characters for which a thread block computes ouut number of blocks = n/sblock number of threads in a thread block number of input characters for which a thread computes ouut = S block /T tword S thread /4 T W total work = effective sing length processed Fig. 2. GPU-to-GPU notation o bytes and the size of the ouut array is ice that of the input array. Our computational sategy is to partition the ouut array into blocks of size S block (Figure 2 summarizes the notation used in this section). The blocks are numbered (indexed) 0 through n/s block, where n is the number of ouut values to be computed. Note that n equals the number of input characters as well. ouut[i S block : (i + 1) S block 1] comprises the ith ouut block. To compute the ith ouut block, it is sufficient for us to use AC oninput[b S block maxl+1 : (b+1) S block 1], where maxl is the length of the longest pattern (for simplicity, we assume that there is a character that is not the first character of any pattern and set input[ maxl+1 : 1] equal to this character) or mbm on input[b S block : (b+ 1) S block +maxl 2] (we may assume that input[n : n+ maxl 2] equals a character that is not the last character of any pattern). So, a block actually processes a sing whose length is S block + maxl 1 and produces S block elements of the ouut. The number of blocks is B = n/s block. Suppose that an ouut block is computed using T threads. Then, each thread could compute S thread = S block /T of the ouut values to be computed by the block. So, thread t (thread indexes begin at 0) of block b could compute ouut[b S block +t S thread : b S block +t S thread +S thread 1]. For this, thread t of block b would need to process the subsing input[b S block +t S thread maxl+1 : b S block +t S thread +S thread 1] when AC is used andinput[b S block +t S thread : b S block +t S thread + S thread +maxl 2] when mbm is used. Figure 3 gives the pseudocode for a T -thread computation of block i of the ouut using the AC DFA. The variables used are self-explanatory and the correctness of the pseudocode follows from the preceding discussion. As discussed earlier, the arraysinput andouut reside in device memory. The AC DFA (or the mbm reverse ies in case the mbm algorithm is used) resides in texture memory because texture memory is cached and is sufficiently large to accommodate the DFA (reverse ie). While shared and constant memories will result in better performance, neither is large enough to accommodate the DFA (reverse ie). Note that each state of a DFA has A ansitions, where A is the alphabet size. For ASCII, A = 256. Assuming that the total number of states is Algorithm basic // compute block b of the ouut array using T threads and AC // following is the code for a single thread, thread t, 0 t < T t = thread index; b = block index; state = 0; // initial DFA state ouutstartindex = b S block +t S thread ; inputstartindex = ouutstartindex maxl + 1; // process input[inputstartindex : ouutstartindex 1] for (int i = inputstartindex; i < ouutstartindex; i++) state = nextstate(state, input[i]); //compute ouut for (int i = ouutstartindex; i < ouutstartindex + S thread ; i++) ouut[i] = state = nextstate(state, input[i]); end; Fig. 3. Overall GPU-to-GPU sategy using AC fewer than 65536, each state ansition of a DFA takes 2 bytes. So, a DFA with d states requires 512d bytes. In the 16KB shared memory that our Tesla has, we can store at best a 32-state DFA. The constant memory on the Tesla is 64KB. So, this can handle, at best, a 128- state DFA. Since the nodes of the mbm reverse ie are as large as a DFA state, it isn t possible to store the reverse ie for any reasonable pattern dictionary in shared or constant memory either. Each of the mbm shift functions, shif t1 and shif t2, need 2 bytes per reverse-ie node. So, our shared memory can store these functions when the number of nodes doesn t exceed 4K; constant memory may be used for ies with fewer than 16K nodes. The bad character function B() has 256 enies when the alphabet size is 256. This function may be stored in shared memory. A nice feature of Algorithm basic is that all T threads that work on a single block can execute in lock-step fashion as there is no divergence in the execution paths of these T threads. This makes it possible for an SM of a GPU to efficiently compute an ouut block using T threads. With 30 SMs, we can compute 30 ouut blocks at a time. The pseudocode of Figure 3 does, however, have deficiencies that are expected to result in nonoptimal performance on a GPU. These deficiencies are described below. Deficiency D1:Since the input array resides in device memory, every reference to the array input requires a device memory ansaction (in this case a read). There are o sources of inefficiency when the read accesses to input are actually made on the Tesla GPU (1) Our Tesla GPU performs device-memory ansactions for a halfwarp (16) of threads at a time. The available bandwidth

5 5 for a single ansaction is 128 bytes. Each thread of our code reads 1 byte. So, a half warp reads 16 bytes. Hence, barring any other limitation of our GPU, our code will utilize 1/8th the available bandwidth beeen device memory and an SM. (2) The Tesla is able to coalesce the device memory ansactions from several threads of a half warp into a single ansaction. However, coalescing occurs only when the device-memory accesses of o or more threads in a half-warp lie in the same 128-byte segment of device memory. When S thread > 128, the values of inputstartindex for consecutive threads in a halfwarp (note that o threads t1 and t2 are in the same half warp iff t1/16 = t2/16 ) are more than 128 bytes apart. Consequently, for any given value of the loop index i, the read accesses made to the array input by the threads of a half warp lie in different 128-byte segments and so no coalescing occurs. Although the pseudocode is written to enable all threads to simultaneously access the needed input character from device memory, an actual implementation on the Tesla GPU will serialize these accesses and, in fact, every read from device memory will ansmit exactly 1 byte to an SM resulting in a 1/128 utilization of the available bandwidth. Deficiency D2:The writes to the array ouut suffer from deficiencies similar to those identified for the reads from the array input. Assuming that our DFA has no more than 2 16 = states, each state can be encoded using 2 bytes. So, a half-warp writes 64 bytes when the available bandwidth for a half warp is 128 bytes. Further, no coalescing takes place as no o threads of a half warp write to the same 128-byte segment. Hence, the writes get serialized and the utilized bandwidth is 2 bytes, which is 1/64th of the available bandwidth. Analysis of Total Work Using the GPU-to-GPU sategy of Figure 3, we essentially do multipattern searches on B T sings of length S thread +maxl 1 each. With a linear complexity for multipattern search, the total work, TW, is roughly equivalent to that done by a sequential algorithm working on an input sing of length TW = B T (S thread +maxl 1) 1 = n (1+ (maxl 1)) S thread So, our GPU-to-GPU sategy incurs an overhead of 1 S thread (maxl 1) 100% in terms of the effective length of the sing that is to be searched. Clearly, this overhead varies substantially with the parameters maxl and S thread. Suppose that maxl = 17, S block = 14592, and T = 64 (as in our experiments of section 7). Then, S thread = 228 and TW = 1.07n. The overhead is 7%. 5.2 Addressing the Deficiencies Deficiency D1 Reading from device memory A simple way to improve the utilization of available bandwidth beeen the device memory and an SM, is to have each thread input 16 characters at a time, process these 16 characters, and write the ouut values for these 16 characters to device memory. For this, we will need to cast the input array from its native data type unsigned char to the data type uint4 as below: uint4 *inputuint4 = (uint4 *) input; A variable var of type uint4 is comprised of 4 unsigned 4-byte integers var.x, var.y, var.z, and var.w. The statement uint4 in4 = inputuint4[i]; reads the 16 bytes input[16*i:16*i+15] and stores these in the variable in4, which is assigned space in shared memory. Since the Tesla is able to read up to 128 bits (16 bytes) at a time for each thread, this simple change increases bandwidth utilization for the reading of the input data from 1/128 of capacity to 1/8 of capacity! However, this increase in bandwidth utilization comes with some cost. To exact the characters from in4 so they may be processed one at a time by our algorithm, we need to do a shift and mask operation on the 4 components of in4. We shall see later that this cost may be avoided by doing a recast to unsigned char. Since a Tesla thread cannot read more than 128 bits at a time, the only way to improve bandwidth utilization further is to coalesce the accesses of multiple threads in a half warp. To get full bandwidth utilization at least 8 threads in a half warp will need to read uint4s that lie in the same 128-byte segment. However, the data to be processed by different threads do not lie in the same segment. To get around this problem, threads cooperatively read all the data needed to process a block, store this data in shared memory, and finally read and process the data from shared memory. In the pseudocode of Figure 4, T threads cooperatively read the input data for block b. This pseudocode, which is for thread t operating on block b, assumes that S block and maxl 1 are divisible by 16 so that a whole number of uint4s are to be read and each read begins at the start of a uint4 boundary (assuming that input[ maxl + 1] begins at a uint4 boundary). In each iteration (except possibly the last one), T threads read a consecutive set of T uint4s from device memory to shared memory and each uint4 is 16 input characters. In each iteration (except possibly the last one) of the for loop, a half warp reads 16 adjacent uint4s for a total of 256 adjacent bytes. If input[ maxl + 1] is at a 128- byte boundary of device memory, S block is a multiple of 128, and T is a multiple of 8, then these 256 bytes fall in byte segments and can be read with o memory ansactions. So, bandwidth utilization is 100%. Although 100% utilization is also obtained using uint2s (now each thread reads 8 bytes at a time rather than 16 and a half warp reads 128 bytes in a single memory ansaction), the observed performance is slightly better when a half warp reads 256 bytes in 2 memory ansactions.

6 6 // define space in shared memory to store the input data shared unsigned char sinput[s block +maxl 1]; // typecast to uint4 uint4 sinputu int4 = ( uint4 )sinput; // read as uint4s, assume S block and maxl 1 are divisible by 16 int numtoread = (S block +maxl 1)/16; int next = b S block /16 (maxl 1)/16+t; // T threads collectively input a block for (int i = t; i < numtoread; i+ = T, next+ = T ) sinputu int4[i] = inputu int4[next]; Fig. 4. T threads collectively read a block and save in shared memory Once we have read the data needed to process a block into shared memory, each thread may generate its share of the ouut array as in Algorithm basic but with the reads being done from shared memory. Thread t will need sinput[t S thread : (t + 1) S thread + maxl 2] or sinputuint4[t S thread /16 : (t + 1) S thread /16 + (maxl 1)/16 1], depending on whether a thread reads the input data from shared memory as characters or as uint4s. When the latter is done, we need to do shifts and masks to exact the characters from the 4 unsigned integer components of a uint4. Although the input scheme of Figure 4 succeeds in reading in the data utilizing 100% of the bandwidth beeen device memory and an SM, there is potential for shared-memory bank conflicts when the threads read the data from shared memory. Shared memory is partitioned into 16 banks. Theith 32-bit word of shared memory is in bank i mod 16. For maximum performance the threads of a half warp should access data from different banks. Suppose that S thread = 224 and sinput begins at a 32- bit word boundary. Let tword = S thread /4 (tword = 224/4 = 56 for our example) denote the number of 32-bit words processed by a thread exclusive of the additional maxl 1 characters needed to properly handle the boundary. In the first iteration of the data processing loop, thread t needs sinput[t S thread ], 0 t < T. So, the words accessed by the threads in the half warp 0 t < 16 are t tword, 0 t < 16 and these fall into banks (t tword) mod 16, 0 t < 16. For our example, tword = 56 and (t 56) mod 16 = 0 when t is even and (t 56) mod 16 = 8 when t is odd. Since each bank is accessed 8 times by the half warp, the reads by a half warp are serialized to 8 shared memory accesses. Further, since on each iteration, each thread steps right by one character, the bank conflicts remain on every iteration of the process loop. We observe that whenever tword is even, at least threads 0 and 8 access the same bank (bank 0) on each iteration of the process loop. Theorem 1 shows that when tword is odd, there are no shared-memory bank conflicts. Theorem 1: When tword is odd, (i tword) mod 16 (jk) mod 16, 0 i < j < 16. Proof: The proof is by conadiction. Assume there exist i and j such that 0 i < j < 16 and (i tword) mod 16 = (j tword) mod 16. For this to be ue, there must exist nonnegative integers a, b, and c, a < c, 0 b < 16 such that i tword = 16a + b and j tword = 16c+b. So, (j i) tword = 16(c a). Since tword is odd and c a > 0, j i must be divisible by 16. However, j i < 16 and so cannot be divisible by 16. This conadiction implies that our assumption is invalid and the theorem is proved. It should be noted that even when tword is odd, the input for every block begins at a 128-byte segment of device memory (assuming that for the first block begins at a 128-byte segment) provided T is a multiple of 32. To see this, observe that S block = 4 T tword, which is a multiple of 128 whenever T is a multiple of 32. As noted earlier, since the Tesla schedules threads in warps of size 32, we normally would choose T to be a multiple of Deficiency D2 Writing to device memory We could use the same sategy used to overcome deficiency D1 to improve bandwidth utilization when writing the results to device memory. This would require us to first have each thread write the results it computes to shared memory and then have all threads collectively write the computed results from shared memory to device memory using uint4s. Since the results take ice the space taken by the input, such a sategy would necessitate a reduction in S block by o-thirds. For example, when maxl = 17, and S block = we need bytes of shared memory for the array sinput. This leaves us with a small amount of 16KB shared memory to store any other data that we may need to. If we wish to store the results in shared memory as well, we must use a smaller value for S block. So, we must reduce S block to about 14592/3 or 4864 to keep the amount of shared memory used the same. When T = 64, this reduction in block size increases the total work overhead from approximately 7% to approximately 22%. We can avoid this increase in total work overhead by doing the following: (1) First, each thread processes the first maxl 1 characters it is to process. This generates no ouut and so we need no memory to store ouut. (2) Next, each thread reads the remaining S thread characters of input data it needs from shared memory to registers. For this, we declare a register array of unsigned integers and typecast sinput to unsigned integer. Since, the T threads have a total of 16,384 registers, we have sufficient registers provided S block = 64K (in reality, S block would need to be slightly smaller than 64K as registers are needed to store other values such as loop variables). Since total register memory exceeds

7 7 the size of shared memory, we always have enough register space to save the input data that is in shared memory. Unless S block 4864, we cannot store all the results in shared memory. However, to do 128-byte write ansactions to device memory, we need only sets of 64 adjacent results (recall that each result is 2 bytes). So, the shared memory needed to store the results is 128T bytes. Since we are contemplating T = 64, we need only 8K of shared memory to store the results from the processing of 64 characters per thread. Once each thread has processed 64 characters and stored these in shared memory, we may write the results to device memory. The total number of ouuts generated by a thread is S thread = 4 tword. These ouuts take a total of 8 tword bytes. So, when tword is odd (as required by Theorem 1), the ouut generated by a thread is a non-integral number of uint4s (recall that each uint4 is 16 bytes). Hence, the ouut for some of the threads does not begin at the start of a uint4 boundary of the device array ouut and we cannot write the results to device memory as uint4s. Rather, we need to write as uint2s (a thread generates an integral number tw ord of uint2s). With each thread writing a uint2, it takes 16 threads to write 128 bytes of ouut from that thread. So, T threads can write the ouut generated from the processing of 64 characters/thread in 16 rounds of uint2 writes. One difficulty is that, as noted earlier, when tword is odd, even though the segment of device memory to which the ouut from a thread is to be written begins at a uint2 boundary, it does not begin at a uint4 boundary. This means also that this segment does not begin at a 128-byte boundary (note that every 128-byte boundary is also a uint4 boundary). So, even though a half-warp of 16 threads is writing to 128 bytes of contiguous device memory, these 128-bytes may not fall within a single 128-byte segment. When this happens, the write is done as o memory ansactions. The described procedure to handle 64 characters of input per thread is repeated S thread /64 times to complete the processing of the entire input block. In case S thread is not divisible by 64, each thread produces fewer than 64 results in the last round. For example, when S thread = 228, we have a total of 4 rounds. In each of the first three rounds, each thread processes 64 input characters and produces 64 results. In the last round, each thread processes 36 characters and produces 36 results. In the last round, groups of threads either write to contiguous device memory segments of size 64 or 8 bytes and some of these segments may span byte segments of device memory. As we can see, using an odd tword is required to avoid shared-memory bank conflicts but using an odd tw ord (actually using a tw ord value that is not a multiple of 16) results in suboptimal writes of the results to device memory. To optimize writes to device memory, we need to use a tword value that is a multiple of 16. Since the Tesla executes threads on an SM in warps of size 32, T would normally be a multiple of 32. Further, to hide memory latency, it is recommended that T be at least 64. With T = 64 and a 16KB shared memory, S thread can be at most /64 = 256 and so tword can be at most 64. However, since a small amount of shared memory is needed for other purposes, tword < 64. The largest value possible for tword that is a multiple of 16 is therefore 48. The total work, TW, when tword = 48 and maxl = 17 is n ( ) = 0.083n. Compared to the case tword = 57, the total work overhead increases from 7% to 8.3%. Whether we are better off using tword = 48, which results in optimized writes to device memory but shared-memory bank conflicts and larger work overhead, or with tword = 57, which has no shared-memory bank conflicts and lower work overhead but suboptimal writes to device memory, can be determined experimentally. 6 HOST-TO-HOST 6.1 Sategies Since the Tesla GPU supports asynchronous ansfer of data beeen device memory and pinned host memory, it is possible to overlap the time spent in data ansfer to and from the device with the time spent by the GPU in computing the results. However, since there is only 1 I/O channel beeen the host and the GPU, time spent writing to the GPU cannot be overlapped with the time spent reading from the GPU. There are at least o ways to accomplish the overlap of I/O beeen host and device and GPU computation. In Sategy A (Figure 5), which is given in [7], we have three loops. The first loop asynchronously writes the input data to device memory in segments, the second processes each segment on the GPU, and third reads the results for each segment back from device memory asynchronously. To ensure that the processing of a segment does not begin before the asynchronous ansfer of that segments data from host to device completes and also that the reading of the results for a segment begins only after the completion of the processing of the segment, CUDA provides the concept of a seam. Within a seam, tasks are done in sequence. With reference to Figure 5, the number of seams equals the number of segments and the tasks in the ith seam are: write segment i to device memory, process segment i, read the results for segment i from device memory. To get the correct results, each segment sent to the device memory must include the additional maxl 1 characters needed to detect matches that cross segment boundaries. For sategy A to work, we must have sufficient device memory to accommodate the input data for all segments as well as the results from all segments. Figure 6 gives an alternative sategy that requires only sufficient device memory for 2 segments (2 input buffers IN0 and IN1 and o ouut buffers OUT0 and OUT1). We could, of course, couple sategies A and B to obtain a hybrid sategy. We analyze the relative time performance of these o host-to-host sategies in the next subsection.

8 8 for (int i = 0; i < numofsegments; i++) Asynchronously write segment i from host to device using seam i; for (int i = 0; i < numofsegments; i++) Process segment i on the GPU using seam i; for (int i = 0; i < numofsegments; i++) Asynchronously read segment i results from device using seam i; Fig. 5. Host-to-host sategy A Write segment 0 from host to device buffer IN0; for (int i = 1; i < numofsegments; i++) { Asynchronously write segment i from host to device buffer IN1; Process segment i 1 on the GPU using IN0 and OUT0; Wait for all read/write/compute to complete; Asynchronously read segment i 1 results from OUT0; Swap roles of IN0 and IN1; Swap roles of OUT0 and OUT1; } Process the last segment on the GPU using IN0 and OUT0; Read last segment s results from OUT0; Fig. 6. Host-to-host sategy B 6.2 Completion Time Figure 7 summarizes the notation used in our analysis of the completion time of sategies A and B. For our analysis, we make several simplifying assumptions as below. 1) The time, t w, to write or copy a segment of input data from the host to the device memory is the same for all segments. 2) The time, t p, the GPU takes to process a segment of input data and create its corresponding ouut segment is the same for all segments. 3) The time, t r, to read or copy a segment of ouut data from the host to the device memory is the same for all segments. 4) The write, processing, and read for each segment begins at the earliest possible time for the chosen sategy and completes t w, t p, and t r units later, respectively. 5) In every feasible sategy, the relative order of segment writes, processing, and reads is the same and is segment 0, followed by segment 1,..., and ending with segment s 1, where s is the number of segments. Writing from the host memory to the device mems t w t r t p T w T r T p number of segments time to write an input data segment from host to device memory time to read an ouut data segment from device to host memory time taken by GPU to process an input data segment and create corresponding ouut segment s 1 s 1 i=0 t w = s t w i=0 t r = s t r s 1 i=0 t p = s t p Ts w (i) time at which the writing of input segment i to device memory starts Ts(i) p time at which the processing of segment i by the GPU starts Ts(i) r time at which the reading of ouut segment i to host memory starts Tf w (i) time at which the writing of input segment i to device memory finishes T p f (i) time at which the processing of segment i by the GPU finishes Tf r (i) time at which the reading of ouut segment i to host memory finishes completion time using sategy A T B completion time using sategy B L lower bound on completion time Fig. 7. Notation used in completion time analysis ory uses the same I/O channel/bus as used to read from the device memory to the host memory and the GPU is necessarily idle when the first input segment is being written to the device memory and the last ouut segment is being read from this memory. So, t w +max{(s 1)(t w +t r ),s t p }+t r, is a lower bound on the completion time of any host-to-host computing sategy. It is easy to see that when the number of segments s is 1, the completion time for both sategies A and B is t w + t p +t r, which equals the lower bound. Actually, whens = 1, both sategies are identical and optimal. The analysis of the o sategies for s > 1 is more complex and is done below in Theorems 2 to 5. We note that assumption 4 implies that T w f (i) = Tw s (i) + t w, T p f (i) = Tp s(i) + t p, and T r f (i) = Tr s(i) + t r, 0 i < s. The completion time is T r f (s 1). Theorem 2: When s > 1, the completion time,, of sategy A is: 1) T w +T r whenever any of following holds: a) t w t p t p T r t r b) t w < t p T w t w > t p t r t p c) t w < t p T w t w > t p t r < t p i,0 i < s[t w +(i+1)t p > T w +it r ] 2) T w +t p +t r when t w t p t p > T r t r 3) t w +t p +T r when t w < t p T w t w t p t r > t p 4) t w +T p +t r when either of the following holds: a) t w < t p T w t w t p t r t p b) t w < t p T w t w > t p t r < t p i,0 i <

9 9 s[t w +(i+1)t p > T w +i t r ] Proof: It should be easy to see that the conditions listed in the theorem exhaust all possibilities. When sategy A is used, all the writes to device memory complete before the first read begins (i.e., Ts(0) r Tf w (s 1)), w (i) = i t w, Tf w(i) = (i + 1)t w, 0 i < s, and Ts(0) r Tf w(s 1) = s t w = T w. When t w t p, Ts(i) p = Tf w(i) = (i + 1)t w (Figure 8 illusates this for s = 4). Hence, T p f (i) = (i + 1)t w + t p Tf w(i+1), 0 i < s, where Tw f (s) is defined to be (s+1)t w. So, Ts(0) r = max{t w,t p f (0)} = T w and Ts(i) r = max{tf r(i 1),Tp f (i)}, 1 i < s. Since Tp f (i) T w, i < s 1, Tf r(i) Tr f (i 1),1 i < s, and Tr f (0) T w, Ts(i) r = Tf r(i 1), 1 i < s 1. So, Tr f (s 2) = Tr s(s 2) + t r = T w + (s 1)t r = T w + T r t r. Hence, Ts(s r 1) = max{t w + T r t r,t w + t p } and = Tf r (s 1) = Ts(s r 1) + t r = max{t w + T r,t w + t p + t r }. So, when t p T r t r, = T W +T r (Figure 8(a)) and when t p > T r t r, = T w +t p +t r (Figure 8(b)). This proves cases 1a and 2 of the theorem. (a) case 4a (b) case 3 Fig. 9. Sategy A, t w < t p, T w t w t p, s = 4 (cases 4a and 3) (a) case 1b (b) case 1c (a) case 1a (c) case 4b (b) case 2 Fig. 10. Sategy A, t w < t p, T w t w > t p, s = 4 (cases 1b, 1c, and 4b) Fig. 8. Sategy A, t w t p, s = 4 (cases 1a and 2) When t w < t p, T p f (i) = t w + (i + 1)t p, 0 i < s. We consider the o subcases T w t w t p and T w t w > t p. The first of these has o subsubcases of its own t r t p (theorem case 4a) and t r > t p (theorem case 3). These subsubcases are shown in Figure 9 for the case of 4 segments. It is easy to see that = t w + T p + t r when t r t p and = t w + t p + T r when t r > t p. The second subcase (t w < t p and T w t w > t p ) has o subsubcases as well t r t p and t r < t p. When t r t p (Figure 10(a)), Ts(i) r = T w + i t r, 0 i < s and = Tf r(s 1) = T w +T r (theorem case 1b). When t r < t p i,0 i < s[t w +(i+1)t p > T w +it r ] (theorem case 1c), i,0 i < s[t p f (i) > Tr f (i 1)], where Tr f ( 1) is defined to be T w. So, = Tf r(s 1)+t r = T w +T r (Figure 10(b)). When t r < t p i,0 i < s[t w + (i + 1)t p > T w + it r ] (theorem case 4b), i,0 i < s[t p f (i) > Tr f (i 1)] and = t w +T p +t r (Figure 10(c)). Theorem 3: The completion time using sategy A is the minimum possible completion time for every combination of t w, t p, and t r. Proof: First, we obtain a tighter lower bound L than the bound t w +max{(s 1)(t w +t r ),s t p } provided at the beginning of this section. Since, writes and reads are done serially on the same I/O channel, L T w + T r. Since, T w f (s 1) T w for every sategy, the processing of the last segment cannot begin before T w. Hence, L T w + t p + t r. Since the processing of the first segment cannot beginning before t w, the read of the first segment s ouut cannot complete before t w + t p + t r. The remaining reads require (s 1)t r time and are done after the first read completes. So, the last read cannot complete before t w + t p + T R. Hence, L t w + t p + t R. Also, since the processing of the first segment cannot begin before t w, T p f (s 1) t w + T p. Hence, L t w +T p +t r. Combining all of these bounds on L, we get L max{t w +T r,t w +t p +t r,t w +t p +T r,t w +T p +t r }. From Theorem 2, we see that, in all cases, equals one of the expressions T w +T r, T w +t p +t r, t w +t p +T r, t w +T p +t r. Hence a tight lower bound on the completion time of every sategy is L = max{t w + T r,t w + t p + t r,t w +t p +T r,t w +T p +t r }. Sategy A achieves this tight lower bound (Theorem 2) and so obtains the minimum completion time possible. Theorem 4: When s > 1, the completion time T B of sategy B is T B = t w +max{t w,t p }+(s 2)max{t w +t r,t p }+max{t p,t r }+t r Proof: When the for loop index i = 1, the read within the loop begins at t w +max{t w,t p }. For 2 i < s,

10 10 this read begins at t w + max{t w,t p } + (i 1)max{t w + t r,t p }. So, the final read, which is outside the loop, begins at t w + max{t w,t p } + (s 2)max{t w + t r,t p } + max{t p,t r } and completes t r units later. Hence, T B = t w +max{t w,t p }+(s 2)max{t w +t r,t p }+max{t p,t r }+t r. Theorem 5: Sategy B does not guarantee minimum completion time. Proof: First, we consider o cases when T B equals the tight lower bound L of Theorem 3. The first, is when t p = min{t w,t p,t r }. Now, T B = T W +T r = L. The second case is when t p t w + t r. Now, from Theorem 4, we obtain T B = t w +T p +t r = L. When, s = 3 and t w > t p > t r > 0, T B = T w + (s 1)t r + t p = T w + T r + t p t r = T w +t p +2t r. When sategy A is used with this data, we are either in case 1a of Theorem 2 and = T w +T r = L < T w +T r +t p t r = T B or we are in case 2 of Theorem 2 and = T w +t p +t r = L < T w +t p +2t r = T B. The next theorem shows that the completion time when using sategy B is less than 13.33% more than when sategy A is used. T B Theorem 6: = 0 when s 2 and TB TA < s 2 s 2 1 2/15 when s > 2. Proof: We consider 5 cases that together exhaust all possibilities for the relationship among t w, t p, and t r. 1) t w t p t r t p From Theorems 2 (case 1a applies as t p t r (s 1)t r = T r t r ) and 4, it follows that = T B = T w +T r. So, (T B )/ = 0. 2) t w t p t r < t p From Theorem 4, T B = T w +T r +t p t r. We may be in either case 1a or case 2 of Theorem 2. If we are in case 1a, then = T w +T r and T r t r = (s 1)t r t p. So, t r t p /(s 1). (Note that since we also have t r < t p, s must be at least 3 for case 1a to apply.) Therefore, t w +t r (1+1/(s 1))t p = s s 1 t p. Hence, T B s 1 ) s 2 s 1 t p = t p t r s(t w +t r ) t p(1 1 = s 2 s 2 If we are in case 2 of Theorem 2, then = T w + t p +t r and t w t p > T r t r = (s 1)t r. So, T B = (s 2)t r st w +t p +t r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < (s 2)t r = s 2 s(s 1)t r +(s 1)t r +t r s 2 < s 2 s 2 1 3) t w < t p t r t p Now, T B = T w + T r + t p t w. For sategy A, we are in case 1b, 3, or 4a of Theorem 2. If we are in case 1b, = T w + T r = s(t w + t r ), t r t p, and (s 1)t w = T w t w > t p. (Note that since we also have t w < t p, s must be at least 3 for case 1b to apply.) So, T B = t p t w s(t w +t r ) < t p(1 1 s 1 ) s( s 1 +t p) = s 2 s 2 When case 3 of Theorem 2 applies, = t w +t p +T r and (s 1)t w t p < t r. So, T B = (s 2)t w t w +t p +st r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is (s 2)t w s(t w +t r ) < s 2 s 2 < s 2 s 2 1 When case 4a of Theorem 2 applies, = t w + T p +t r, (s 1)t w = T w t w t p, and t r = t p. So, = t w +(s+1)t p t w +(s+1)(s 1)t w = s 2 t w and T B = (s 1)t w +(s+1)t r t w (s+1)t r = (s 2)t w. Therefore, T B (s 2)t w s 2 t w = s 2 s 2 < s 2 s 2 1 4) t w < t p t r < t p t w +t r t p Now, T B = T w +T r +2t p t w t r. For sategy A, we are in case 1c, 4a or 4b of Theorem 2. If we are in case 1c, = T w +T r and t w +st p st w +(s 1)t r. So, (s 1)(t w +t r ) st p or t w +t r s s 1 t p. Hence, T B = 2t p t w t r s(t w +t r ) 2 s s 1 s 2 s 1 = s 2 s 2 The right hand side is 0 when s = 2 and is < s 2 s 2 1 when s > 2. For both cases 4a and 4b, = t w +T p +t r. When case 4a applies, (s 1)t w t p. Since, t w +t r t p, < s 2 s 2 1, s > t 2 r t p t w t p (1 1 s 1 ) = s 2 s 1 t p (s 2)t w Hence, T B = (s 2)(t w +t r t p ) t w +st p +t r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < (s 2)t w t w +s(s 1)t w +(s 2)t w = s 2 s 2 1 When case 4b applies, it follows from t p > t r and the conditions for case 4b that t w +st p > T w +(s 1)t r = st w +(s 1)t r. So, st p > (s 1)(t w +t r ) and t w + t r < s s 1 t p. Hence, t w + t r t p < t p /(s 1). From this inequality and t w +t r t p, we get T B = (s 2)(t w +t r t p ) t w +st p +t r < s 2 s 2 1, s > 2

11 11 The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < s 2 s 1 t s 2 p s 1 t p = s 2 t w +st p +t r (s+1)t p s 2 1 5) t w < t p t r < t p t w +t r < t p Now, T B = t w +T p +t r =. Only cases 1c and 4a of Theorem 2 are possible when t w < t p t r < t p. However, since we also have t w +t r < t p,(s 1)t p > (s 1)t w +(s 1)t r. So, t w +(s 1)t p > T w +(s 1)t r. Hence, t w +st p > T w +(s 1)t r. Therefore, case 1c does not apply and = t w + T p + t r = T B. So, (T B )/ = 0. To see that the bound of Theorem 6 is quite tight, consider the instance s = 4, t r = t p = 3, and t w = 1. For this instance, = t w + T p + t r = 16 (case 4a) and T B = 18. So, ( T B )/ = 1/8. The instance falls into case 3 (subcase 4a of Theorem 2) of Theorem 6 for which we have shown T B s 2 s 2 1/8 Sategy B becomes more competitive with sategy A as the number of segments s increases. For example, when s = 20, (T B )/ < 18/399 = Completion Time Using Enhanced GPUs In this section, we analyze the completion times of sategies A and B under the assumption that we are using a GPU system that is enhanced so that there are o I/O channels/buses beeen the host CPU and the GPU and the CPU has a dual-port memory that supports simultaneous reads and writes. In this case the writing of an input data segment to device memory can be overlapped with the reading of an ouut data segment from device memory. When s = 1, the enhanced GPU is unable to perform any better than the original GPU and = T B = t w +t p +t r. Theorems 7 through 11 are the enhanced GPU analogs of Theorems 2 through 5 for the case s > 1. Theorem 7: When s > 1, the completion time,, of sategy A for the enhanced GPU model is T w +t p +t r t w t p t w t r = t w +T p +t r t w < t p t p t r t w +t p +T r otherwise Proof: As for the original GPU model, Ts w (i) = i t w, 0 i < s. When t w t p, Ts(i) p = Tf w(i) = (i+1) t w and T p f (i) = Tw f (i)+t p = (i+1) t w +t p. Figure 11(a) shows the schedule of events when s = 4, t w t p and t w t r. Since t w t r, Ts(i) r = T p f (i) = (i+1) t w +t p, 0 i < s. So, Ts(s 1) r = s t w +t p and = Tf r(s 1) = T w+t p +t r. Figure 11(b) shows the schedule of events when s = 4, t w t p and t w < t r. Since t w < t r, Ts(i) r = Tf r(i 1) = Ts(i r 1) + t r, 0 < i < s. Since Tf r(0) = t w + t p + t r, Ts(s r 1) = t w + t p + (s 1)t r and = Tf r (s 1) = t w +t p +T r. (a) t w and t w t r (b) t w t p and t w < t r (c) t w < t p and t p t r (d) t w < t p < t r Fig. 11. Sategy A, enhanced GPU, s = 4 When t w < t p and t p t r, T p s(i) = t w +i t p, 0 i < s and T r s(i) = T p f (i), 0 i < s (Figure 11(c)). So, = t w +s t p +t r = t w +T p +t r. The final case to consider ist w < t p < t r (Figure 11(d)). Now, T p s(i) = t w +i t p, 0 i < s and T r s(i) = t w +t p + i t r. So, = t w +t p +T r. Theorem 8: For the enhanced GPU model, the completion time using sategy A is the minimum possible completion time for every combination of t w, t p, and t r. Proof: The earliest time the processing of the last segment can begin is T w. So, T p f (s 1) T w + t p. So, the completion time L of every sategy is at least T w + t p + t r. Further, the earliest time the read of the first ouut segment can begin is t w +t p. Following this earliest time, the read channel is needed for at least s t r time to complete the reads. So, L t w + t p + T r. Also, since T p s(0) = t w, the earliest time the processing of the last segment can begin is t w + (s 1)t p, Hence L t w +T p +t r. Combining these lower bounds, we get L max{t w + t p + t r,t w + T p + t r,t w + t p + T r }. Since equals the derived lower bound, the lower bound is tight, L = max{t w +t p +t r,t w +T p +t r,t w +t p +T r } and is the minimum possible completion time. The next theorem shows that the optimal completion time using the original GPU model of Section 6.2 is at most ice that for the enhanced model of this section. Theorem 9: Let t w, t p, t r, and s define a host-to-host instance. Let C1 and C2, respectively, be the completion time for an optimal host-to-host execution using the original and enhanced GPU models. C1 2 C2 and this bound is tight. Proof: From the proofs of Theorems 3 and 8, it follows that C1 = max{t w +T r,t w +t p +t r,t w +t p +T r,t w + T p +t r } andc2 = max{t w +t p +t r,t w +t p +T r,t w +T p +t r }. Hence, when C1 T w +T r, C1 = C2 2 C2. Assume

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017 CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Research into GPU accelerated pattern matching for applications in computer security

Research into GPU accelerated pattern matching for applications in computer security Research into GPU accelerated pattern matching for applications in computer security November 4, 2009 Alexander Gee age19@student.canterbury.ac.nz Department of Computer Science and Software Engineering

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach Process Scheduling for RTS Dr. Hugh Melvin, Dept. of IT, NUI,G RTS Scheduling Approach RTS typically control multiple parameters concurrently Eg. Flight Control System Speed, altitude, inclination etc..

More information

EDF Feasibility and Hardware Accelerators

EDF Feasibility and Hardware Accelerators EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo, Waterloo, Canada, arrmorton@uwaterloo.ca Wayne M. Loucks University of Waterloo, Waterloo, Canada, wmloucks@pads.uwaterloo.ca

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Vector Layout in Virtual-Memory Systems for Data-Parallel Computing

Vector Layout in Virtual-Memory Systems for Data-Parallel Computing Vector Layout in Virtual-Memory Systems for Data-Parallel Computing Thomas H. Cormen Department of Mathematics and Computer Science Dartmouth College Abstract In a data-parallel computer with virtual memory,

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document

More information

COMPUTER SCIENCE TRIPOS

COMPUTER SCIENCE TRIPOS CST.2016.2.1 COMPUTER SCIENCE TRIPOS Part IA Tuesday 31 May 2016 1.30 to 4.30 COMPUTER SCIENCE Paper 2 Answer one question from each of Sections A, B and C, and two questions from Section D. Submit the

More information

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS POLITECNICO DI MILANO Facoltà di Ingegneria dell Informazione Corso di Laurea in Ingegneria Informatica DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS Relatore: Prof.

More information

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of

More information

Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers

Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers Jonathan Turner Washington University jon.turner@wustl.edu January 30, 2008 Abstract Crossbar-based switches are commonly used

More information

Sparse solver 64 bit and out-of-core addition

Sparse solver 64 bit and out-of-core addition Sparse solver 64 bit and out-of-core addition Prepared By: Richard Link Brian Yuen Martec Limited 1888 Brunswick Street, Suite 400 Halifax, Nova Scotia B3J 3J8 PWGSC Contract Number: W7707-145679 Contract

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

GPU Acceleration of BCP Procedure for SAT Algorithms

GPU Acceleration of BCP Procedure for SAT Algorithms GPU Acceleration of BCP Procedure for SAT Algorithms Hironori Fujii 1 and Noriyuki Fujimoto 1 1 Graduate School of Science Osaka Prefecture University 1-1 Gakuencho, Nakaku, Sakai, Osaka 599-8531, Japan

More information

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 21: April 4, 2017 Memory Overview, Memory Core Cells Penn ESE 570 Spring 2017 Khanna Today! Memory " Classification " ROM Memories " RAM Memory

More information

Scheduling Slack Time in Fixed Priority Pre-emptive Systems

Scheduling Slack Time in Fixed Priority Pre-emptive Systems Scheduling Slack Time in Fixed Priority Pre-emptive Systems R.I.Davis Real-Time Systems Research Group, Department of Computer Science, University of York, England. ABSTRACT This report addresses the problem

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

CPU SCHEDULING RONG ZHENG

CPU SCHEDULING RONG ZHENG CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

CIS 4930/6930: Principles of Cyber-Physical Systems

CIS 4930/6930: Principles of Cyber-Physical Systems CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 11 Scheduling Hao Zheng Department of Computer Science and Engineering University of South Florida H. Zheng (CSE USF) CIS 4930/6930: Principles

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Multiplying Products of Prime Powers

Multiplying Products of Prime Powers Problem 1: Multiplying Products of Prime Powers Each positive integer can be expressed (in a unique way, according to the Fundamental Theorem of Arithmetic) as a product of powers of the prime numbers.

More information

Stream Ciphers. Çetin Kaya Koç Winter / 20

Stream Ciphers. Çetin Kaya Koç   Winter / 20 Çetin Kaya Koç http://koclab.cs.ucsb.edu Winter 2016 1 / 20 Linear Congruential Generators A linear congruential generator produces a sequence of integers x i for i = 1,2,... starting with the given initial

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Time. To do. q Physical clocks q Logical clocks

Time. To do. q Physical clocks q Logical clocks Time To do q Physical clocks q Logical clocks Events, process states and clocks A distributed system A collection P of N single-threaded processes (p i, i = 1,, N) without shared memory The processes in

More information

COMPUTER SCIENCE TRIPOS

COMPUTER SCIENCE TRIPOS CST0.2017.2.1 COMPUTER SCIENCE TRIPOS Part IA Thursday 8 June 2017 1.30 to 4.30 COMPUTER SCIENCE Paper 2 Answer one question from each of Sections A, B and C, and two questions from Section D. Submit the

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1> Chapter 3 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 3 Chapter 3 :: Topics Introduction Latches and Flip-Flops Synchronous Logic Design Finite

More information

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Finite State Machines Introduction Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Such devices form

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 19: March 29, 2018 Memory Overview, Memory Core Cells Today! Charge Leakage/Charge Sharing " Domino Logic Design Considerations! Logic Comparisons!

More information

From Sequential Circuits to Real Computers

From Sequential Circuits to Real Computers 1 / 36 From Sequential Circuits to Real Computers Lecturer: Guillaume Beslon Original Author: Lionel Morel Computer Science and Information Technologies - INSA Lyon Fall 2017 2 / 36 Introduction What we

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

Introduction to Languages and Computation

Introduction to Languages and Computation Introduction to Languages and Computation George Voutsadakis 1 1 Mathematics and Computer Science Lake Superior State University LSSU Math 400 George Voutsadakis (LSSU) Languages and Computation July 2014

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. 1 2 Introduction Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. Defines the precise instants when the circuit is allowed to change

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Calculating Algebraic Signatures Thomas Schwarz, S.J.

Calculating Algebraic Signatures Thomas Schwarz, S.J. Calculating Algebraic Signatures Thomas Schwarz, S.J. 1 Introduction A signature is a small string calculated from a large object. The primary use of signatures is the identification of objects: equal

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Exam Spring Embedded Systems. Prof. L. Thiele

Exam Spring Embedded Systems. Prof. L. Thiele Exam Spring 20 Embedded Systems Prof. L. Thiele NOTE: The given solution is only a proposal. For correctness, completeness, or understandability no responsibility is taken. Sommer 20 Eingebettete Systeme

More information

Remainders. We learned how to multiply and divide in elementary

Remainders. We learned how to multiply and divide in elementary Remainders We learned how to multiply and divide in elementary school. As adults we perform division mostly by pressing the key on a calculator. This key supplies the quotient. In numerical analysis and

More information

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

! Charge Leakage/Charge Sharing.  Domino Logic Design Considerations. ! Logic Comparisons. ! Memory.  Classification.  ROM Memories. ESE 57: Digital Integrated Circuits and VLSI Fundamentals Lec 9: March 9, 8 Memory Overview, Memory Core Cells Today! Charge Leakage/ " Domino Logic Design Considerations! Logic Comparisons! Memory " Classification

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science,

More information

Nikolaj Leischner, Vitaly Osipov, Peter Sanders

Nikolaj Leischner, Vitaly Osipov, Peter Sanders Nikolaj Leichner, Vitaly Oipov, Peter Sander Intitut für Theoretiche Informatik - Algorithmik II Oipov: 1 KITVitaly Univerität de Lande Baden-Württemerg und nationale Groforchungzentrum in der Helmholtz-Gemeinchaft

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

Binary Decision Diagrams and Symbolic Model Checking

Binary Decision Diagrams and Symbolic Model Checking Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

of 19 23/02/ :23

of 19 23/02/ :23 LOS ONE: cutauleaping: A GPU-Powered Tau-Leaping Stochastic... of 19 23/02/2015 14:23 Marco S. Nobile, Paolo Cazzaniga, Daniela Besozzi, Dario Pescini, Giancarlo Mauri Published: March 24, 2014 DOI: 10.1371/journal.pone.0091963

More information

80386DX. 32-Bit Microprocessor FEATURES: DESCRIPTION: Logic Diagram

80386DX. 32-Bit Microprocessor FEATURES: DESCRIPTION: Logic Diagram 32-Bit Microprocessor 21 1 22 1 2 10 3 103 FEATURES: 32-Bit microprocessor RAD-PAK radiation-hardened agait natural space radiation Total dose hardness: - >100 Krad (Si), dependent upon space mission Single

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Lab Course: distributed data analytics

Lab Course: distributed data analytics Lab Course: distributed data analytics 01. Threading and Parallelism Nghia Duong-Trung, Mohsan Jameel Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany International

More information