GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU

Size: px

Start display at page:

Download "GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU"

Diane Sims
5 years ago
Views:

1 GPU-to-GPU and Host-to-Host Multipattern Sing Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL {xzha, sahni}@cise.ufl.edu Absact We develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore sing matching algorithms for the o cases GPU-to-GPU and host-to-host. For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze o sategies to communicate beeen the host and the GPU and show that one is optimal with respect to run time while the other requires less device memory. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho- Corasick GPU adaptation achieves a speedup beeen 8.5 and 9.5 relative to a single-thread CPU implementation and beeen 2.4 and 3.2 relative to the best multithreaded implementation. For the host-tohost case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged beeen 0.74 and Early versions of our multipattern Boyer- Moore adaptations ran 7% to 10% slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer- Moore codes further. Keywords: Multipattern sing matching, Aho- Corasick, multipattern Boyer-Moore, GPU, CUDA. 1 INTRODUCTION In multipattern sing matching, we are to report all occurrences of a given set or dictionary of patterns in a target sing. Multipattern sing matching arises in a number of applications including neork inusion detection, digital forensics, business analytics, and natural language processing. For example, the popular opensource neork inusion detection system Snort [22] has a dictionary of several thousand patterns that are matched against the contents of Internet packets and the open-source file carver Scalpel [18] searches for all occurrences of headers and footers from a dictionary of about 40 header/footer pairs in disks that are many gigabytes in size. In both applications, the performance of the multipattern matching engine is paramount. In This research was supported, in part, by the National Science Foundation under grants and CNS the case of Snort, it is necessary to search for thousands of patterns in relatively small packets at Internet speed while in the case of Scalpel we need to search for tens of patterns in hundreds of gigabytes of disk data. Snort [22] employs the Aho-Corasick [1] multipattern search method while Scalpel [18] uses the Boyer-Moore single pattern search algorithm [5]. Since Scalpel uses a single pattern search algorithm, its run time is linear in the product of the number of patterns in the pattern dictionary and the length of the target sing in which the search is being done. Snort, on the other hand, because of its use of an efficient multipattern search algorithm has a run time that is independent of the number of patterns in the dictionary and linear in the length of the target sing. Several researchers have attempted to improve the performance of multising matching applications through the use of parallelism. For example, Scarpazza et al. [19], [20] port the deterministic finite automata version of the Aho-Corasick method to the IBM Cell Broadband Engine (CBE) while Zha et al. [28] port a compressed form of the non-deterministic finite automata version of the Aho-Corasick method to the CBE. Jacob et al. [12] port Snort to a GPU. However, in their port, they replace the Aho-Corasick search method employed by Snort with the Knuth-Morris-Pratt [13] singlepattern matching algorithm. Specifically, they search for 16 different patterns in a packet in parallel employing 16 GPU cores. Huang et al. [11] do neork inusion detection on a GPU based on the multipattern search algorithm of Wu and Manber [27]. Smith et al. [21] use deterministic finite automata and extended deterministic finite automata to do regular expression matching on a GPU for inusion detection applications. Marziale et al. [14] propose the use of GPUs and massive parallelism for in-place file carving. However, Zha and Sahni [29] show that the performance of an in-place file carver is limited by the time required to read data from the disk rather than the time required to search for headers and footers (when a fast multipattern matching algorithm is

2 2 used). Hence, by doing asynchronous disk reads, the pattern matching time is effectively overlapped by the disk read time and the total time for the in-place carving operation equals that of the disk read time. Therefore, this application cannot benefit from the use of a GPU to accelerate pattern matching. Our focus in this paper is accelerating the Aho- Corasick and Boyer-Moore multipattern sing matching algorithms through the use of a GPU. A GPU operates in aditional master-slave fashion (see [17], for example) in which the GPU is a slave that is attached to a master or host processor under whose direction it operates. Algorithm development for master-slave systems is affected by the location of the input data and where the results are to be left. Generally, four cases arise [24], [25], [26] as below. 1) Slave-to-slave. In this case the inputs and ouuts for the algorithm are on the slave memory. This case arises, for example, when an earlier computation produced results that were left in slave memory and these results are the inputs to the algorithm being developed; further, the results from the algorithm being developed are to be used for subsequent computation by the slave. 2) Host-to-host. Here the inputs to the algorithm are on the host and the results are to be left on the host. So, the algorithm needs to account for the time it takes to move the inputs to the slave and that to bring the results back to the host. 3) Host-to-slave. The inputs are in the host but the results are to be left in the slave. 4) Slave-to-host. The inputs are in the slave and the results are to be left in the host. In this paper, we address the first o cases only. In our context, we refer to the first case (slave-to-slave) as GPU-to-GPU. The remainder of this paper is organized as follows. Section 2 inoduces the NVIDIA Tesla architecture. In Sections 3 and 4, respectively, we describe the Aho-Corasick and Boyer-Moore multipattern matching algorithms. Sections 5 and 6 describe our GPU adaptation of these matching algorithms for the GPUto-GPU and host-to-host cases. Section 7 discusses our experimental results and we conclude in Section 8. 2 THE NVIDIA TESLA ARCHITECTURE Figure 1 gives the architecture of the NVIDIA GT200 Tesla GPU, which is an example of NVIDIA s general purpose parallel computing architecture CUDA (Compute Unified Driver Architecture) [7]. This GPU comprises 240 scalar processors (SP) or cores that are organized into 30 seaming multiprocessors (SM) each comprised of 8 SPs. Each SM has 16KB of on-chip shared memory, bit registers, and constant and texture cache. Each SM supports up to 1024 active threads. There also is 4GB of global or device memory that is accessible to all 240 SPs. The Tesla, like other GPUs, operates as a slave processor to an attached host. In our experimental Fig. 1. NVIDIA GT200 Architecture [23] setup, the host is a 2.8GHz Xeon quad-core processor with 16GB of memory. A CUDA program typically is a C program written for the host. C extensions supported by the CUDA programming environment allow the host to send and receive data to/from the GPU s device memory as well as to invoke C functions (called kernels) that run on the GPU cores. The GPU programming model is Single Insuction Multiple Thread (SIMT). When a kernel is invoked, the user must specify the number of threads to be invoked. This is done by specifying explicitly the number of thread blocks and the number of threads per block. CUDA further organizes the threads of a block into warps of 32 threads each, each block of threads is assigned to a single SM, and thread warps are executed synchronously on SMs. While thread divergence within a warp is permitted, when the threads of a warp diverge, the divergent paths are executed serially until they converge. A CUDA kernel may access different types of memory with each having different capacity, latency and caching properties. We summarize the memory hierarchy below. Device memory: 4GB of device memory are available. This memory can be read and written directly by all threads. However, device memory access entails a high latency (400 to 600 clock cycles). The thread scheduler attempts to hide this latency by scheduling arithmetics that are ready to be performed while waiting for the access to device memory to complete [7]. Device memory is not cached. Constant memory: Constant memory is read-only memory space that is shared by all threads. Constant memory is cached and is limited to 64KB. Shared memory: Each SM has 16KB of shared memory. Shared memory is divided into 16 banks of 32- bit words. When there are no bank conflicts, the threads of a warp can access shared memory as fast as they can access registers [7]. Texture memory: Texture memory, like constant memory, is read-only memory space that is accessible to all threads using device functions called texture fetches. Texture memory is initialized at the

3 3 host side and is read at the device side. Texture memory is cached. Pinned memory (also known as Page-Locked Host Memory): This is part of the host memory. Data ansfer beeen pinned and device memory is faster than beeen pageable host memory and device memory. Also, this data ansfer can be done concurrent with kernel execution. However, since allocating part of the host memory as pinned reduces the amount of physical memory available to the operating system for paging, allocating too much page-locked memory reduces overall system performance [7]. 3 THE AHO-CORASICK ALGORITHM There are o versions nondeterministic and deterministic of the Aho-Corasick (AC) [1] multipattern matching algorithm. Both versions use a finite state machine/automaton to represent the dictionary of patterns. In the deterministic version (DFA), which is the version we use in this paper, each state of the finite automaton has a well-defined ansition for every character in the alphabet as well as a list of matched patterns. The search starts with the automaton start state designated as the current state and the first character in the text sing, S, that is being searched designated as the current character. At each step, a state ansition is made by examining the current character of S. A ansition to the state corresponding to the current character is made and the next character of S becomes the current character. Whenever a state ansition is made, the patterns in the list of matched patterns for the reached state are ouut along with the position in S of the current character. This ouut is sufficient to identify all occurrences, in S, of all dictionary patterns. Aho and Corasick [1] show how to compute the DFA for a set of patterns. The number of state ansitions made by the DFA when searching for matches in a sing of length n is n. 4 MULTIPATTERN BOYER-MOORE ALGO- RITHM The Boyer-Moore pattern matching algorithm [5] was developed to find all occurrences of a pattern P in a sing S. This algorithm begins by positioning the first character of P at the first character of S. This results in a pairing of the first P characters of S with characters of P. The characters in each pair are compared beginning with those in the rightmost pair. If all pairs of characters match, we have found an occurrence of P in S and P is shifted right by 1 character (or by P if only nonoverlapping matches are to be found). Otherwise, we stop at the rightmost pair (or first pair since we compare right to left) where there is a mismatch and use the bad character function for P to determine how many characters to shift P right before re-examining pairs of characters from P and S for a match. More specifically, the bad character function for P gives the distance from the end of P of the last occurrence of each possible character that may appear in S. In practice, many of the shifts in the bad character function of a pattern are close to the length, P, of the pattern P making the Boyer- Moore algorithm a very fast search algorithm. In fact, when the alphabet size is large, the average run time of the Boyer-Moore algorithm is O( S / P ). Galil [9] has proposed a variation for which the worst-case run time is O( S ). Horspool [10] proposes a simplification to the Boyer-Moore algorithm whose performance is about the same as that of the Boyer-Moore algorithm. Several multipattern extensions to the Boyer-Moore search algorithm have been proposed [2], [4], [8], [27]. All of these multipattern search algorithms extend the basic bad character function employed by the Boyer- Moore algorithm to a bad character function for a set of patterns. This is done by combining the bad character functions for the individual patterns to be searched into a single bad character function for the entire set of patterns. The combined bad character function B for a set of p patterns has B(c) = min{b i (c),1 i p} for each character c in the alphabet. Here B i is the bad character function for the ith pattern. The Set-wise Boyer-Moore algorithm of [8] performs multipattern matching using this combined bad function. The multipattern search algorithms of [2], [4], [27] employ additional techniques to speed the search further. The average run time of the algorithms of [2], [4], [27] is O( S /minl), where minl is the length of the shortest pattern. The multipattern Boyer-Moore algorithm used by us is due to [4]. This algorithm employs o additional functions shift1 and shift2 and a ie, called the reverse ie, which contains the reverse of the patterns. 5 GPU-TO-GPU 5.1 Sategy The input to the multipattern matcher is a character array input and the ouut is an array ouut of states or reverse-ie node indexes (or pointers). Both arrays reside in device memory. When the Aho-Corasick (AC) algorithm is used,ouut[i] gives the state of the AC DFA following the processing of input[i]. Since every state of the AC DFA contains a list of patterns that are matched when this state is reached, ouut[i] enables us to determine all matching patterns that end at input character i. When the multipattern Boyer-Moore (mbm) algorithm is used, ouut[i] is the last reverse ie node visited over all examinations of input[i]. Using this information and the pattern list stored in the ie node we may determine all pattern matches that begin at input[i]. If we assume that the number of states in the AC DFA as well as the number of nodes in the mbm reverse ie is no more than 65536, a state/node index can be encoded using

4 4 n maxl S block B T S thread number of characters in sing to be searched length of longest pattern number of input characters for which a thread block computes ouut number of blocks = n/sblock number of threads in a thread block number of input characters for which a thread computes ouut = S block /T tword S thread /4 T W total work = effective sing length processed Fig. 2. GPU-to-GPU notation o bytes and the size of the ouut array is ice that of the input array. Our computational sategy is to partition the ouut array into blocks of size S block (Figure 2 summarizes the notation used in this section). The blocks are numbered (indexed) 0 through n/s block, where n is the number of ouut values to be computed. Note that n equals the number of input characters as well. ouut[i S block : (i + 1) S block 1] comprises the ith ouut block. To compute the ith ouut block, it is sufficient for us to use AC oninput[b S block maxl+1 : (b+1) S block 1], where maxl is the length of the longest pattern (for simplicity, we assume that there is a character that is not the first character of any pattern and set input[ maxl+1 : 1] equal to this character) or mbm on input[b S block : (b+ 1) S block +maxl 2] (we may assume that input[n : n+ maxl 2] equals a character that is not the last character of any pattern). So, a block actually processes a sing whose length is S block + maxl 1 and produces S block elements of the ouut. The number of blocks is B = n/s block. Suppose that an ouut block is computed using T threads. Then, each thread could compute S thread = S block /T of the ouut values to be computed by the block. So, thread t (thread indexes begin at 0) of block b could compute ouut[b S block +t S thread : b S block +t S thread +S thread 1]. For this, thread t of block b would need to process the subsing input[b S block +t S thread maxl+1 : b S block +t S thread +S thread 1] when AC is used andinput[b S block +t S thread : b S block +t S thread + S thread +maxl 2] when mbm is used. Figure 3 gives the pseudocode for a T -thread computation of block i of the ouut using the AC DFA. The variables used are self-explanatory and the correctness of the pseudocode follows from the preceding discussion. As discussed earlier, the arraysinput andouut reside in device memory. The AC DFA (or the mbm reverse ies in case the mbm algorithm is used) resides in texture memory because texture memory is cached and is sufficiently large to accommodate the DFA (reverse ie). While shared and constant memories will result in better performance, neither is large enough to accommodate the DFA (reverse ie). Note that each state of a DFA has A ansitions, where A is the alphabet size. For ASCII, A = 256. Assuming that the total number of states is Algorithm basic // compute block b of the ouut array using T threads and AC // following is the code for a single thread, thread t, 0 t < T t = thread index; b = block index; state = 0; // initial DFA state ouutstartindex = b S block +t S thread ; inputstartindex = ouutstartindex maxl + 1; // process input[inputstartindex : ouutstartindex 1] for (int i = inputstartindex; i < ouutstartindex; i++) state = nextstate(state, input[i]); //compute ouut for (int i = ouutstartindex; i < ouutstartindex + S thread ; i++) ouut[i] = state = nextstate(state, input[i]); end; Fig. 3. Overall GPU-to-GPU sategy using AC fewer than 65536, each state ansition of a DFA takes 2 bytes. So, a DFA with d states requires 512d bytes. In the 16KB shared memory that our Tesla has, we can store at best a 32-state DFA. The constant memory on the Tesla is 64KB. So, this can handle, at best, a 128- state DFA. Since the nodes of the mbm reverse ie are as large as a DFA state, it isn t possible to store the reverse ie for any reasonable pattern dictionary in shared or constant memory either. Each of the mbm shift functions, shif t1 and shif t2, need 2 bytes per reverse-ie node. So, our shared memory can store these functions when the number of nodes doesn t exceed 4K; constant memory may be used for ies with fewer than 16K nodes. The bad character function B() has 256 enies when the alphabet size is 256. This function may be stored in shared memory. A nice feature of Algorithm basic is that all T threads that work on a single block can execute in lock-step fashion as there is no divergence in the execution paths of these T threads. This makes it possible for an SM of a GPU to efficiently compute an ouut block using T threads. With 30 SMs, we can compute 30 ouut blocks at a time. The pseudocode of Figure 3 does, however, have deficiencies that are expected to result in nonoptimal performance on a GPU. These deficiencies are described below. Deficiency D1:Since the input array resides in device memory, every reference to the array input requires a device memory ansaction (in this case a read). There are o sources of inefficiency when the read accesses to input are actually made on the Tesla GPU (1) Our Tesla GPU performs device-memory ansactions for a halfwarp (16) of threads at a time. The available bandwidth

5 5 for a single ansaction is 128 bytes. Each thread of our code reads 1 byte. So, a half warp reads 16 bytes. Hence, barring any other limitation of our GPU, our code will utilize 1/8th the available bandwidth beeen device memory and an SM. (2) The Tesla is able to coalesce the device memory ansactions from several threads of a half warp into a single ansaction. However, coalescing occurs only when the device-memory accesses of o or more threads in a half-warp lie in the same 128-byte segment of device memory. When S thread > 128, the values of inputstartindex for consecutive threads in a halfwarp (note that o threads t1 and t2 are in the same half warp iff t1/16 = t2/16 ) are more than 128 bytes apart. Consequently, for any given value of the loop index i, the read accesses made to the array input by the threads of a half warp lie in different 128-byte segments and so no coalescing occurs. Although the pseudocode is written to enable all threads to simultaneously access the needed input character from device memory, an actual implementation on the Tesla GPU will serialize these accesses and, in fact, every read from device memory will ansmit exactly 1 byte to an SM resulting in a 1/128 utilization of the available bandwidth. Deficiency D2:The writes to the array ouut suffer from deficiencies similar to those identified for the reads from the array input. Assuming that our DFA has no more than 2 16 = states, each state can be encoded using 2 bytes. So, a half-warp writes 64 bytes when the available bandwidth for a half warp is 128 bytes. Further, no coalescing takes place as no o threads of a half warp write to the same 128-byte segment. Hence, the writes get serialized and the utilized bandwidth is 2 bytes, which is 1/64th of the available bandwidth. Analysis of Total Work Using the GPU-to-GPU sategy of Figure 3, we essentially do multipattern searches on B T sings of length S thread +maxl 1 each. With a linear complexity for multipattern search, the total work, TW, is roughly equivalent to that done by a sequential algorithm working on an input sing of length TW = B T (S thread +maxl 1) 1 = n (1+ (maxl 1)) S thread So, our GPU-to-GPU sategy incurs an overhead of 1 S thread (maxl 1) 100% in terms of the effective length of the sing that is to be searched. Clearly, this overhead varies substantially with the parameters maxl and S thread. Suppose that maxl = 17, S block = 14592, and T = 64 (as in our experiments of section 7). Then, S thread = 228 and TW = 1.07n. The overhead is 7%. 5.2 Addressing the Deficiencies Deficiency D1 Reading from device memory A simple way to improve the utilization of available bandwidth beeen the device memory and an SM, is to have each thread input 16 characters at a time, process these 16 characters, and write the ouut values for these 16 characters to device memory. For this, we will need to cast the input array from its native data type unsigned char to the data type uint4 as below: uint4 *inputuint4 = (uint4 *) input; A variable var of type uint4 is comprised of 4 unsigned 4-byte integers var.x, var.y, var.z, and var.w. The statement uint4 in4 = inputuint4[i]; reads the 16 bytes input[16*i:16*i+15] and stores these in the variable in4, which is assigned space in shared memory. Since the Tesla is able to read up to 128 bits (16 bytes) at a time for each thread, this simple change increases bandwidth utilization for the reading of the input data from 1/128 of capacity to 1/8 of capacity! However, this increase in bandwidth utilization comes with some cost. To exact the characters from in4 so they may be processed one at a time by our algorithm, we need to do a shift and mask operation on the 4 components of in4. We shall see later that this cost may be avoided by doing a recast to unsigned char. Since a Tesla thread cannot read more than 128 bits at a time, the only way to improve bandwidth utilization further is to coalesce the accesses of multiple threads in a half warp. To get full bandwidth utilization at least 8 threads in a half warp will need to read uint4s that lie in the same 128-byte segment. However, the data to be processed by different threads do not lie in the same segment. To get around this problem, threads cooperatively read all the data needed to process a block, store this data in shared memory, and finally read and process the data from shared memory. In the pseudocode of Figure 4, T threads cooperatively read the input data for block b. This pseudocode, which is for thread t operating on block b, assumes that S block and maxl 1 are divisible by 16 so that a whole number of uint4s are to be read and each read begins at the start of a uint4 boundary (assuming that input[ maxl + 1] begins at a uint4 boundary). In each iteration (except possibly the last one), T threads read a consecutive set of T uint4s from device memory to shared memory and each uint4 is 16 input characters. In each iteration (except possibly the last one) of the for loop, a half warp reads 16 adjacent uint4s for a total of 256 adjacent bytes. If input[ maxl + 1] is at a 128- byte boundary of device memory, S block is a multiple of 128, and T is a multiple of 8, then these 256 bytes fall in byte segments and can be read with o memory ansactions. So, bandwidth utilization is 100%. Although 100% utilization is also obtained using uint2s (now each thread reads 8 bytes at a time rather than 16 and a half warp reads 128 bytes in a single memory ansaction), the observed performance is slightly better when a half warp reads 256 bytes in 2 memory ansactions.

6 6 // define space in shared memory to store the input data shared unsigned char sinput[s block +maxl 1]; // typecast to uint4 uint4 sinputu int4 = ( uint4 )sinput; // read as uint4s, assume S block and maxl 1 are divisible by 16 int numtoread = (S block +maxl 1)/16; int next = b S block /16 (maxl 1)/16+t; // T threads collectively input a block for (int i = t; i < numtoread; i+ = T, next+ = T ) sinputu int4[i] = inputu int4[next]; Fig. 4. T threads collectively read a block and save in shared memory Once we have read the data needed to process a block into shared memory, each thread may generate its share of the ouut array as in Algorithm basic but with the reads being done from shared memory. Thread t will need sinput[t S thread : (t + 1) S thread + maxl 2] or sinputuint4[t S thread /16 : (t + 1) S thread /16 + (maxl 1)/16 1], depending on whether a thread reads the input data from shared memory as characters or as uint4s. When the latter is done, we need to do shifts and masks to exact the characters from the 4 unsigned integer components of a uint4. Although the input scheme of Figure 4 succeeds in reading in the data utilizing 100% of the bandwidth beeen device memory and an SM, there is potential for shared-memory bank conflicts when the threads read the data from shared memory. Shared memory is partitioned into 16 banks. Theith 32-bit word of shared memory is in bank i mod 16. For maximum performance the threads of a half warp should access data from different banks. Suppose that S thread = 224 and sinput begins at a 32- bit word boundary. Let tword = S thread /4 (tword = 224/4 = 56 for our example) denote the number of 32-bit words processed by a thread exclusive of the additional maxl 1 characters needed to properly handle the boundary. In the first iteration of the data processing loop, thread t needs sinput[t S thread ], 0 t < T. So, the words accessed by the threads in the half warp 0 t < 16 are t tword, 0 t < 16 and these fall into banks (t tword) mod 16, 0 t < 16. For our example, tword = 56 and (t 56) mod 16 = 0 when t is even and (t 56) mod 16 = 8 when t is odd. Since each bank is accessed 8 times by the half warp, the reads by a half warp are serialized to 8 shared memory accesses. Further, since on each iteration, each thread steps right by one character, the bank conflicts remain on every iteration of the process loop. We observe that whenever tword is even, at least threads 0 and 8 access the same bank (bank 0) on each iteration of the process loop. Theorem 1 shows that when tword is odd, there are no shared-memory bank conflicts. Theorem 1: When tword is odd, (i tword) mod 16 (jk) mod 16, 0 i < j < 16. Proof: The proof is by conadiction. Assume there exist i and j such that 0 i < j < 16 and (i tword) mod 16 = (j tword) mod 16. For this to be ue, there must exist nonnegative integers a, b, and c, a < c, 0 b < 16 such that i tword = 16a + b and j tword = 16c+b. So, (j i) tword = 16(c a). Since tword is odd and c a > 0, j i must be divisible by 16. However, j i < 16 and so cannot be divisible by 16. This conadiction implies that our assumption is invalid and the theorem is proved. It should be noted that even when tword is odd, the input for every block begins at a 128-byte segment of device memory (assuming that for the first block begins at a 128-byte segment) provided T is a multiple of 32. To see this, observe that S block = 4 T tword, which is a multiple of 128 whenever T is a multiple of 32. As noted earlier, since the Tesla schedules threads in warps of size 32, we normally would choose T to be a multiple of Deficiency D2 Writing to device memory We could use the same sategy used to overcome deficiency D1 to improve bandwidth utilization when writing the results to device memory. This would require us to first have each thread write the results it computes to shared memory and then have all threads collectively write the computed results from shared memory to device memory using uint4s. Since the results take ice the space taken by the input, such a sategy would necessitate a reduction in S block by o-thirds. For example, when maxl = 17, and S block = we need bytes of shared memory for the array sinput. This leaves us with a small amount of 16KB shared memory to store any other data that we may need to. If we wish to store the results in shared memory as well, we must use a smaller value for S block. So, we must reduce S block to about 14592/3 or 4864 to keep the amount of shared memory used the same. When T = 64, this reduction in block size increases the total work overhead from approximately 7% to approximately 22%. We can avoid this increase in total work overhead by doing the following: (1) First, each thread processes the first maxl 1 characters it is to process. This generates no ouut and so we need no memory to store ouut. (2) Next, each thread reads the remaining S thread characters of input data it needs from shared memory to registers. For this, we declare a register array of unsigned integers and typecast sinput to unsigned integer. Since, the T threads have a total of 16,384 registers, we have sufficient registers provided S block = 64K (in reality, S block would need to be slightly smaller than 64K as registers are needed to store other values such as loop variables). Since total register memory exceeds

7 7 the size of shared memory, we always have enough register space to save the input data that is in shared memory. Unless S block 4864, we cannot store all the results in shared memory. However, to do 128-byte write ansactions to device memory, we need only sets of 64 adjacent results (recall that each result is 2 bytes). So, the shared memory needed to store the results is 128T bytes. Since we are contemplating T = 64, we need only 8K of shared memory to store the results from the processing of 64 characters per thread. Once each thread has processed 64 characters and stored these in shared memory, we may write the results to device memory. The total number of ouuts generated by a thread is S thread = 4 tword. These ouuts take a total of 8 tword bytes. So, when tword is odd (as required by Theorem 1), the ouut generated by a thread is a non-integral number of uint4s (recall that each uint4 is 16 bytes). Hence, the ouut for some of the threads does not begin at the start of a uint4 boundary of the device array ouut and we cannot write the results to device memory as uint4s. Rather, we need to write as uint2s (a thread generates an integral number tw ord of uint2s). With each thread writing a uint2, it takes 16 threads to write 128 bytes of ouut from that thread. So, T threads can write the ouut generated from the processing of 64 characters/thread in 16 rounds of uint2 writes. One difficulty is that, as noted earlier, when tword is odd, even though the segment of device memory to which the ouut from a thread is to be written begins at a uint2 boundary, it does not begin at a uint4 boundary. This means also that this segment does not begin at a 128-byte boundary (note that every 128-byte boundary is also a uint4 boundary). So, even though a half-warp of 16 threads is writing to 128 bytes of contiguous device memory, these 128-bytes may not fall within a single 128-byte segment. When this happens, the write is done as o memory ansactions. The described procedure to handle 64 characters of input per thread is repeated S thread /64 times to complete the processing of the entire input block. In case S thread is not divisible by 64, each thread produces fewer than 64 results in the last round. For example, when S thread = 228, we have a total of 4 rounds. In each of the first three rounds, each thread processes 64 input characters and produces 64 results. In the last round, each thread processes 36 characters and produces 36 results. In the last round, groups of threads either write to contiguous device memory segments of size 64 or 8 bytes and some of these segments may span byte segments of device memory. As we can see, using an odd tword is required to avoid shared-memory bank conflicts but using an odd tw ord (actually using a tw ord value that is not a multiple of 16) results in suboptimal writes of the results to device memory. To optimize writes to device memory, we need to use a tword value that is a multiple of 16. Since the Tesla executes threads on an SM in warps of size 32, T would normally be a multiple of 32. Further, to hide memory latency, it is recommended that T be at least 64. With T = 64 and a 16KB shared memory, S thread can be at most /64 = 256 and so tword can be at most 64. However, since a small amount of shared memory is needed for other purposes, tword < 64. The largest value possible for tword that is a multiple of 16 is therefore 48. The total work, TW, when tword = 48 and maxl = 17 is n ( ) = 0.083n. Compared to the case tword = 57, the total work overhead increases from 7% to 8.3%. Whether we are better off using tword = 48, which results in optimized writes to device memory but shared-memory bank conflicts and larger work overhead, or with tword = 57, which has no shared-memory bank conflicts and lower work overhead but suboptimal writes to device memory, can be determined experimentally. 6 HOST-TO-HOST 6.1 Sategies Since the Tesla GPU supports asynchronous ansfer of data beeen device memory and pinned host memory, it is possible to overlap the time spent in data ansfer to and from the device with the time spent by the GPU in computing the results. However, since there is only 1 I/O channel beeen the host and the GPU, time spent writing to the GPU cannot be overlapped with the time spent reading from the GPU. There are at least o ways to accomplish the overlap of I/O beeen host and device and GPU computation. In Sategy A (Figure 5), which is given in [7], we have three loops. The first loop asynchronously writes the input data to device memory in segments, the second processes each segment on the GPU, and third reads the results for each segment back from device memory asynchronously. To ensure that the processing of a segment does not begin before the asynchronous ansfer of that segments data from host to device completes and also that the reading of the results for a segment begins only after the completion of the processing of the segment, CUDA provides the concept of a seam. Within a seam, tasks are done in sequence. With reference to Figure 5, the number of seams equals the number of segments and the tasks in the ith seam are: write segment i to device memory, process segment i, read the results for segment i from device memory. To get the correct results, each segment sent to the device memory must include the additional maxl 1 characters needed to detect matches that cross segment boundaries. For sategy A to work, we must have sufficient device memory to accommodate the input data for all segments as well as the results from all segments. Figure 6 gives an alternative sategy that requires only sufficient device memory for 2 segments (2 input buffers IN0 and IN1 and o ouut buffers OUT0 and OUT1). We could, of course, couple sategies A and B to obtain a hybrid sategy. We analyze the relative time performance of these o host-to-host sategies in the next subsection.

8 8 for (int i = 0; i < numofsegments; i++) Asynchronously write segment i from host to device using seam i; for (int i = 0; i < numofsegments; i++) Process segment i on the GPU using seam i; for (int i = 0; i < numofsegments; i++) Asynchronously read segment i results from device using seam i; Fig. 5. Host-to-host sategy A Write segment 0 from host to device buffer IN0; for (int i = 1; i < numofsegments; i++) { Asynchronously write segment i from host to device buffer IN1; Process segment i 1 on the GPU using IN0 and OUT0; Wait for all read/write/compute to complete; Asynchronously read segment i 1 results from OUT0; Swap roles of IN0 and IN1; Swap roles of OUT0 and OUT1; } Process the last segment on the GPU using IN0 and OUT0; Read last segment s results from OUT0; Fig. 6. Host-to-host sategy B 6.2 Completion Time Figure 7 summarizes the notation used in our analysis of the completion time of sategies A and B. For our analysis, we make several simplifying assumptions as below. 1) The time, t w, to write or copy a segment of input data from the host to the device memory is the same for all segments. 2) The time, t p, the GPU takes to process a segment of input data and create its corresponding ouut segment is the same for all segments. 3) The time, t r, to read or copy a segment of ouut data from the host to the device memory is the same for all segments. 4) The write, processing, and read for each segment begins at the earliest possible time for the chosen sategy and completes t w, t p, and t r units later, respectively. 5) In every feasible sategy, the relative order of segment writes, processing, and reads is the same and is segment 0, followed by segment 1,..., and ending with segment s 1, where s is the number of segments. Writing from the host memory to the device mems t w t r t p T w T r T p number of segments time to write an input data segment from host to device memory time to read an ouut data segment from device to host memory time taken by GPU to process an input data segment and create corresponding ouut segment s 1 s 1 i=0 t w = s t w i=0 t r = s t r s 1 i=0 t p = s t p Ts w (i) time at which the writing of input segment i to device memory starts Ts(i) p time at which the processing of segment i by the GPU starts Ts(i) r time at which the reading of ouut segment i to host memory starts Tf w (i) time at which the writing of input segment i to device memory finishes T p f (i) time at which the processing of segment i by the GPU finishes Tf r (i) time at which the reading of ouut segment i to host memory finishes completion time using sategy A T B completion time using sategy B L lower bound on completion time Fig. 7. Notation used in completion time analysis ory uses the same I/O channel/bus as used to read from the device memory to the host memory and the GPU is necessarily idle when the first input segment is being written to the device memory and the last ouut segment is being read from this memory. So, t w +max{(s 1)(t w +t r ),s t p }+t r, is a lower bound on the completion time of any host-to-host computing sategy. It is easy to see that when the number of segments s is 1, the completion time for both sategies A and B is t w + t p +t r, which equals the lower bound. Actually, whens = 1, both sategies are identical and optimal. The analysis of the o sategies for s > 1 is more complex and is done below in Theorems 2 to 5. We note that assumption 4 implies that T w f (i) = Tw s (i) + t w, T p f (i) = Tp s(i) + t p, and T r f (i) = Tr s(i) + t r, 0 i < s. The completion time is T r f (s 1). Theorem 2: When s > 1, the completion time,, of sategy A is: 1) T w +T r whenever any of following holds: a) t w t p t p T r t r b) t w < t p T w t w > t p t r t p c) t w < t p T w t w > t p t r < t p i,0 i < s[t w +(i+1)t p > T w +it r ] 2) T w +t p +t r when t w t p t p > T r t r 3) t w +t p +T r when t w < t p T w t w t p t r > t p 4) t w +T p +t r when either of the following holds: a) t w < t p T w t w t p t r t p b) t w < t p T w t w > t p t r < t p i,0 i <

9 9 s[t w +(i+1)t p > T w +i t r ] Proof: It should be easy to see that the conditions listed in the theorem exhaust all possibilities. When sategy A is used, all the writes to device memory complete before the first read begins (i.e., Ts(0) r Tf w (s 1)), w (i) = i t w, Tf w(i) = (i + 1)t w, 0 i < s, and Ts(0) r Tf w(s 1) = s t w = T w. When t w t p, Ts(i) p = Tf w(i) = (i + 1)t w (Figure 8 illusates this for s = 4). Hence, T p f (i) = (i + 1)t w + t p Tf w(i+1), 0 i < s, where Tw f (s) is defined to be (s+1)t w. So, Ts(0) r = max{t w,t p f (0)} = T w and Ts(i) r = max{tf r(i 1),Tp f (i)}, 1 i < s. Since Tp f (i) T w, i < s 1, Tf r(i) Tr f (i 1),1 i < s, and Tr f (0) T w, Ts(i) r = Tf r(i 1), 1 i < s 1. So, Tr f (s 2) = Tr s(s 2) + t r = T w + (s 1)t r = T w + T r t r. Hence, Ts(s r 1) = max{t w + T r t r,t w + t p } and = Tf r (s 1) = Ts(s r 1) + t r = max{t w + T r,t w + t p + t r }. So, when t p T r t r, = T W +T r (Figure 8(a)) and when t p > T r t r, = T w +t p +t r (Figure 8(b)). This proves cases 1a and 2 of the theorem. (a) case 4a (b) case 3 Fig. 9. Sategy A, t w < t p, T w t w t p, s = 4 (cases 4a and 3) (a) case 1b (b) case 1c (a) case 1a (c) case 4b (b) case 2 Fig. 10. Sategy A, t w < t p, T w t w > t p, s = 4 (cases 1b, 1c, and 4b) Fig. 8. Sategy A, t w t p, s = 4 (cases 1a and 2) When t w < t p, T p f (i) = t w + (i + 1)t p, 0 i < s. We consider the o subcases T w t w t p and T w t w > t p. The first of these has o subsubcases of its own t r t p (theorem case 4a) and t r > t p (theorem case 3). These subsubcases are shown in Figure 9 for the case of 4 segments. It is easy to see that = t w + T p + t r when t r t p and = t w + t p + T r when t r > t p. The second subcase (t w < t p and T w t w > t p ) has o subsubcases as well t r t p and t r < t p. When t r t p (Figure 10(a)), Ts(i) r = T w + i t r, 0 i < s and = Tf r(s 1) = T w +T r (theorem case 1b). When t r < t p i,0 i < s[t w +(i+1)t p > T w +it r ] (theorem case 1c), i,0 i < s[t p f (i) > Tr f (i 1)], where Tr f ( 1) is defined to be T w. So, = Tf r(s 1)+t r = T w +T r (Figure 10(b)). When t r < t p i,0 i < s[t w + (i + 1)t p > T w + it r ] (theorem case 4b), i,0 i < s[t p f (i) > Tr f (i 1)] and = t w +T p +t r (Figure 10(c)). Theorem 3: The completion time using sategy A is the minimum possible completion time for every combination of t w, t p, and t r. Proof: First, we obtain a tighter lower bound L than the bound t w +max{(s 1)(t w +t r ),s t p } provided at the beginning of this section. Since, writes and reads are done serially on the same I/O channel, L T w + T r. Since, T w f (s 1) T w for every sategy, the processing of the last segment cannot begin before T w. Hence, L T w + t p + t r. Since the processing of the first segment cannot beginning before t w, the read of the first segment s ouut cannot complete before t w + t p + t r. The remaining reads require (s 1)t r time and are done after the first read completes. So, the last read cannot complete before t w + t p + T R. Hence, L t w + t p + t R. Also, since the processing of the first segment cannot begin before t w, T p f (s 1) t w + T p. Hence, L t w +T p +t r. Combining all of these bounds on L, we get L max{t w +T r,t w +t p +t r,t w +t p +T r,t w +T p +t r }. From Theorem 2, we see that, in all cases, equals one of the expressions T w +T r, T w +t p +t r, t w +t p +T r, t w +T p +t r. Hence a tight lower bound on the completion time of every sategy is L = max{t w + T r,t w + t p + t r,t w +t p +T r,t w +T p +t r }. Sategy A achieves this tight lower bound (Theorem 2) and so obtains the minimum completion time possible. Theorem 4: When s > 1, the completion time T B of sategy B is T B = t w +max{t w,t p }+(s 2)max{t w +t r,t p }+max{t p,t r }+t r Proof: When the for loop index i = 1, the read within the loop begins at t w +max{t w,t p }. For 2 i < s,

10 10 this read begins at t w + max{t w,t p } + (i 1)max{t w + t r,t p }. So, the final read, which is outside the loop, begins at t w + max{t w,t p } + (s 2)max{t w + t r,t p } + max{t p,t r } and completes t r units later. Hence, T B = t w +max{t w,t p }+(s 2)max{t w +t r,t p }+max{t p,t r }+t r. Theorem 5: Sategy B does not guarantee minimum completion time. Proof: First, we consider o cases when T B equals the tight lower bound L of Theorem 3. The first, is when t p = min{t w,t p,t r }. Now, T B = T W +T r = L. The second case is when t p t w + t r. Now, from Theorem 4, we obtain T B = t w +T p +t r = L. When, s = 3 and t w > t p > t r > 0, T B = T w + (s 1)t r + t p = T w + T r + t p t r = T w +t p +2t r. When sategy A is used with this data, we are either in case 1a of Theorem 2 and = T w +T r = L < T w +T r +t p t r = T B or we are in case 2 of Theorem 2 and = T w +t p +t r = L < T w +t p +2t r = T B. The next theorem shows that the completion time when using sategy B is less than 13.33% more than when sategy A is used. T B Theorem 6: = 0 when s 2 and TB TA < s 2 s 2 1 2/15 when s > 2. Proof: We consider 5 cases that together exhaust all possibilities for the relationship among t w, t p, and t r. 1) t w t p t r t p From Theorems 2 (case 1a applies as t p t r (s 1)t r = T r t r ) and 4, it follows that = T B = T w +T r. So, (T B )/ = 0. 2) t w t p t r < t p From Theorem 4, T B = T w +T r +t p t r. We may be in either case 1a or case 2 of Theorem 2. If we are in case 1a, then = T w +T r and T r t r = (s 1)t r t p. So, t r t p /(s 1). (Note that since we also have t r < t p, s must be at least 3 for case 1a to apply.) Therefore, t w +t r (1+1/(s 1))t p = s s 1 t p. Hence, T B s 1 ) s 2 s 1 t p = t p t r s(t w +t r ) t p(1 1 = s 2 s 2 If we are in case 2 of Theorem 2, then = T w + t p +t r and t w t p > T r t r = (s 1)t r. So, T B = (s 2)t r st w +t p +t r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < (s 2)t r = s 2 s(s 1)t r +(s 1)t r +t r s 2 < s 2 s 2 1 3) t w < t p t r t p Now, T B = T w + T r + t p t w. For sategy A, we are in case 1b, 3, or 4a of Theorem 2. If we are in case 1b, = T w + T r = s(t w + t r ), t r t p, and (s 1)t w = T w t w > t p. (Note that since we also have t w < t p, s must be at least 3 for case 1b to apply.) So, T B = t p t w s(t w +t r ) < t p(1 1 s 1 ) s( s 1 +t p) = s 2 s 2 When case 3 of Theorem 2 applies, = t w +t p +T r and (s 1)t w t p < t r. So, T B = (s 2)t w t w +t p +st r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is (s 2)t w s(t w +t r ) < s 2 s 2 < s 2 s 2 1 When case 4a of Theorem 2 applies, = t w + T p +t r, (s 1)t w = T w t w t p, and t r = t p. So, = t w +(s+1)t p t w +(s+1)(s 1)t w = s 2 t w and T B = (s 1)t w +(s+1)t r t w (s+1)t r = (s 2)t w. Therefore, T B (s 2)t w s 2 t w = s 2 s 2 < s 2 s 2 1 4) t w < t p t r < t p t w +t r t p Now, T B = T w +T r +2t p t w t r. For sategy A, we are in case 1c, 4a or 4b of Theorem 2. If we are in case 1c, = T w +T r and t w +st p st w +(s 1)t r. So, (s 1)(t w +t r ) st p or t w +t r s s 1 t p. Hence, T B = 2t p t w t r s(t w +t r ) 2 s s 1 s 2 s 1 = s 2 s 2 The right hand side is 0 when s = 2 and is < s 2 s 2 1 when s > 2. For both cases 4a and 4b, = t w +T p +t r. When case 4a applies, (s 1)t w t p. Since, t w +t r t p, < s 2 s 2 1, s > t 2 r t p t w t p (1 1 s 1 ) = s 2 s 1 t p (s 2)t w Hence, T B = (s 2)(t w +t r t p ) t w +st p +t r The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < (s 2)t w t w +s(s 1)t w +(s 2)t w = s 2 s 2 1 When case 4b applies, it follows from t p > t r and the conditions for case 4b that t w +st p > T w +(s 1)t r = st w +(s 1)t r. So, st p > (s 1)(t w +t r ) and t w + t r < s s 1 t p. Hence, t w + t r t p < t p /(s 1). From this inequality and t w +t r t p, we get T B = (s 2)(t w +t r t p ) t w +st p +t r < s 2 s 2 1, s > 2

11 11 The right hand side equals 0 when s = 2 and for s > 2, the right hand side is < s 2 s 1 t s 2 p s 1 t p = s 2 t w +st p +t r (s+1)t p s 2 1 5) t w < t p t r < t p t w +t r < t p Now, T B = t w +T p +t r =. Only cases 1c and 4a of Theorem 2 are possible when t w < t p t r < t p. However, since we also have t w +t r < t p,(s 1)t p > (s 1)t w +(s 1)t r. So, t w +(s 1)t p > T w +(s 1)t r. Hence, t w +st p > T w +(s 1)t r. Therefore, case 1c does not apply and = t w + T p + t r = T B. So, (T B )/ = 0. To see that the bound of Theorem 6 is quite tight, consider the instance s = 4, t r = t p = 3, and t w = 1. For this instance, = t w + T p + t r = 16 (case 4a) and T B = 18. So, ( T B )/ = 1/8. The instance falls into case 3 (subcase 4a of Theorem 2) of Theorem 6 for which we have shown T B s 2 s 2 1/8 Sategy B becomes more competitive with sategy A as the number of segments s increases. For example, when s = 20, (T B )/ < 18/399 = Completion Time Using Enhanced GPUs In this section, we analyze the completion times of sategies A and B under the assumption that we are using a GPU system that is enhanced so that there are o I/O channels/buses beeen the host CPU and the GPU and the CPU has a dual-port memory that supports simultaneous reads and writes. In this case the writing of an input data segment to device memory can be overlapped with the reading of an ouut data segment from device memory. When s = 1, the enhanced GPU is unable to perform any better than the original GPU and = T B = t w +t p +t r. Theorems 7 through 11 are the enhanced GPU analogs of Theorems 2 through 5 for the case s > 1. Theorem 7: When s > 1, the completion time,, of sategy A for the enhanced GPU model is T w +t p +t r t w t p t w t r = t w +T p +t r t w < t p t p t r t w +t p +T r otherwise Proof: As for the original GPU model, Ts w (i) = i t w, 0 i < s. When t w t p, Ts(i) p = Tf w(i) = (i+1) t w and T p f (i) = Tw f (i)+t p = (i+1) t w +t p. Figure 11(a) shows the schedule of events when s = 4, t w t p and t w t r. Since t w t r, Ts(i) r = T p f (i) = (i+1) t w +t p, 0 i < s. So, Ts(s 1) r = s t w +t p and = Tf r(s 1) = T w+t p +t r. Figure 11(b) shows the schedule of events when s = 4, t w t p and t w < t r. Since t w < t r, Ts(i) r = Tf r(i 1) = Ts(i r 1) + t r, 0 < i < s. Since Tf r(0) = t w + t p + t r, Ts(s r 1) = t w + t p + (s 1)t r and = Tf r (s 1) = t w +t p +T r. (a) t w and t w t r (b) t w t p and t w < t r (c) t w < t p and t p t r (d) t w < t p < t r Fig. 11. Sategy A, enhanced GPU, s = 4 When t w < t p and t p t r, T p s(i) = t w +i t p, 0 i < s and T r s(i) = T p f (i), 0 i < s (Figure 11(c)). So, = t w +s t p +t r = t w +T p +t r. The final case to consider ist w < t p < t r (Figure 11(d)). Now, T p s(i) = t w +i t p, 0 i < s and T r s(i) = t w +t p + i t r. So, = t w +t p +T r. Theorem 8: For the enhanced GPU model, the completion time using sategy A is the minimum possible completion time for every combination of t w, t p, and t r. Proof: The earliest time the processing of the last segment can begin is T w. So, T p f (s 1) T w + t p. So, the completion time L of every sategy is at least T w + t p + t r. Further, the earliest time the read of the first ouut segment can begin is t w +t p. Following this earliest time, the read channel is needed for at least s t r time to complete the reads. So, L t w + t p + T r. Also, since T p s(0) = t w, the earliest time the processing of the last segment can begin is t w + (s 1)t p, Hence L t w +T p +t r. Combining these lower bounds, we get L max{t w + t p + t r,t w + T p + t r,t w + t p + T r }. Since equals the derived lower bound, the lower bound is tight, L = max{t w +t p +t r,t w +T p +t r,t w +t p +T r } and is the minimum possible completion time. The next theorem shows that the optimal completion time using the original GPU model of Section 6.2 is at most ice that for the enhanced model of this section. Theorem 9: Let t w, t p, t r, and s define a host-to-host instance. Let C1 and C2, respectively, be the completion time for an optimal host-to-host execution using the original and enhanced GPU models. C1 2 C2 and this bound is tight. Proof: From the proofs of Theorems 3 and 8, it follows that C1 = max{t w +T r,t w +t p +t r,t w +t p +T r,t w + T p +t r } andc2 = max{t w +t p +t r,t w +t p +T r,t w +T p +t r }. Hence, when C1 T w +T r, C1 = C2 2 C2. Assume

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign