Parallel Longest Common Subsequence using Graphics Hardware

Size: px

Start display at page:

Download "Parallel Longest Common Subsequence using Graphics Hardware"

Alannah Cook
6 years ago
Views:

1 Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1

2 Overview Introduction Problem Statement ackground and Related Work The NVIDI G80 rchitecture lgorithm Description Results and nalysis Conclusion 2

3 Introduction Worked on GPU acceleration of Dynamic Programming Specifically, problems in the Gaussian Elimination Paradigm (GEP) More specifically, Longest Common Subsequence as a representative problem belonging to the GEP 3

4 Problem Statement Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine Must make efficient use of both CPU and GPU architectures Must have theoretical justification of design 4

5 Overview Introduction Problem Statement ackground and Related Work The NVIDI G80 rchitecture lgorithm Description Results and nalysis Conclusion 5

6 Related Work General Purpose on Graphics Hardware NVIDI CUD Owens et al. (2005) Linear Dynamic Programming Hirschberg (1975) Chowdhury et al. (2006) GPU Sequence lignment Liu et al. (2007) Schatz et al. (2007) 6

7 The NVIDI G80 rchitecture 16 multiprocessors, 8 cores each 128 logical processors 1.35 GHz 768 M of RM 86.4G/sec transfer rate (8.5G/sec Core 2 Duo) 520 GFLOPS (22 GFLOPS Core 2 Duo) NVIDI CUD Programming Guide, 1.0 7

8 The NVIDI G80 rchitecture Program workflow: CPU (host) creates kernel program GPU maps kernel blocks to processors Processors map kernel threads to processor cores Cores execute in parallel NVIDI CUD Programming Guide, 1.0 8

9 Overview Introduction Problem Statement ackground and Related Work The NVIDI G80 rchitecture lgorithm Description Results and nalysis Conclusion 9

10 lgorithm Description The SIMPLE-LCS recurrence Requires quadratic space, which limits scalability Faster than Chowdhury et al. linear space method 10

11 SIMPLE-LCS Example 11

12 SIMPLE-LCS Example

13 SIMPLE-LCS Example

14 SIMPLE-LCS Example

15 SIMPLE-LCS Example

16 SIMPLE-LCS Example

17 SIMPLE-LCS Example

18 SIMPLE-LCS Example

19 SIMPLE-LCS Example

20 SIMPLE-LCS Example

21 SIMPLE-LCS Example

22 SIMPLE-LCS Example

23 SIMPLE-LCS Example

24 SIMPLE-LCS Example

25 SIMPLE-LCS Example

26 SIMPLE-LCS Example

27 SIMPLE-LCS Example

28 SIMPLE-LCS Example

29 SIMPLE-LCS Example

30 SIMPLE-LCS Example

31 lgorithm Description Chowdhury et al. perform CPU quadratic space algorithm on small subproblems CH-LCS is their linear space algorithm CUTOFF ranges from

32 lgorithm Description Our approach is to add another base case solved quickly on the GPU GPU-LCS is our new algorithm (not recursive) GPU-CUTOFF is 2 16 CUTOFF is

33 lgorithm Description CH: CPU Linear Space DP GPU: GPU DP GPU level 1: GPU Quadratic Space DP (block level) GPU level 2: GPU Linear Space DP (thread level) Simple: CPU Quadratic Space DP 33

34 CH: CPU Linear Space DP Two recursive functions used: Output boundary LCS reconstruction 34

35 CH: CPU Linear Space DP Output boundary: Given input boundary, computes output boundary Expects subproblem size to be square, with power-of-two lengths 35

36 Pushing Example

37 Pushing Example

38 Pushing Example

39 Pushing Example

40 Pushing Example

41 Pushing Example

42 Pushing Example

43 Pushing Example

44 Pushing Example

45 Pushing Example

46 Pushing Example

47 Pushing Example

48 Pushing Example

49 Pushing Example

50 Pushing Example

51 Pushing Example

52 Pushing Example

53 Pushing Example

54 Pushing Example

55 lgorithm Description CH: CPU Linear Space DP GPU: GPU DP GPU level 1: GPU Quadratic Space DP (block level) GPU level 2: GPU Linear Space DP (thread level) Simple: CPU Quadratic Space DP 55

56 GPU Processing Overview Two levels of parallelism locks are executed on a processor Threads are executed on a processor core Each thread is computed by exactly one processor core 56

57 GPU Level 1: Quadratic Space Length of LCS with max length of 2 16 Divide DP matrix into blocks, each block is solved by one of the GPU processors We must enforce the correct order of block execution Each diagonal can be computed in parallel 57

58 GPU Level 1: Quadratic Space The basic quadratic space DP algorithm would require 16 G of memory We fold the memory to store only the input/output boundary for each block Reduces the storage required to 64 M From n 2 to 2(n 2 /m) where m = 512 Duplicate some values to avoid memory contention 58

59 lgorithm Description CH: CPU Linear Space DP GPU: GPU DP GPU level 1: GPU Quadratic Space DP (block level) GPU level 2: GPU Linear Space DP (thread level) Simple: CPU Quadratic Space DP 59

60 GPU Level 2: Linear Space Within each block we also have more parallelism Divide each block into threads Each processor core computes one thread at a time Hardware-level synchronization ensures the correct diagonal ordering Each core reuses the same space (white) and computes the entire logical matrix (grey) 60

61 GPU Level 2 : Linear Space Each thread is a 4x4 subproblem The size was determined by experimentation This memory is on chip, so we do not have to worry about memory conflicts The linear space algorithm allows us to make each block as large as possible, which allows for very fast execution 61

62 lgorithm Description CH: CPU Linear Space DP GPU: GPU DP GPU level 1: GPU Quadratic Space DP (block level) GPU level 2: GPU Linear Space DP (thread level) Simple: CPU Quadratic Space DP 62

63 Simple: CPU Quadratic Space DP Only gets called when a subproblem is too small for the GPU Implements SIMPLE-LCS, the classic matrix-based LCS algorithm 63

64 Overview Introduction Problem Statement ackground and Related Work The NVIDI G80 rchitecture lgorithm Description Results and nalysis Conclusion 64

65 Results and nalysis GPU thread width of 4 proves optimal 65

66 Results and nalysis GPU block width of 512 is slightly faster 66

67 Results and nalysis CPU/GPU cutoff sizes determined experimentally 67

68 Results and nalysis Test DN sequence data obtained from Mike rudno Over five-fold performance improvement from results in Chowdhury et al. on all sequence comparisons Species Length Human 1.80 Chimp 1.32 aboon 1.51 Chicken 0.42 Fugu 0.27 Cow 1.46 Mouse 1.49 Rat 1.50 Cat 1.16 Dog 1.05 Lengths in millions 68

69 Conclusion We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences GPU implementation over five-fold performance boost over single CPU implementation 69

70 Future Work We believe our algorithm can be accelerated further with careful optimization Memory management on the GPU Memory transfer between CPU and GPU Investigation of other computation models Implementations using 8xCPU + 2xGPU? 70

71 Questions? Special thanks to Rezaul Chowdhury for his support and Mike rudno for the DN sequence data 71

72 NVIDI CUD Compute Unified Device rchitecture vailable on G80 Series rchitecture for utilizing the GPU as a data-parallel computing device Eliminates the need to map computation through graphics PI User writes a C style function which is then run in parallel on the GPU 72

73 CH: CPU Linear Space DP LCS reconstruction Computes output boundaries in specific order Traces back through boundaries to generate LCS Linear space 73

74 CH: CPU Linear Space DP LCS reconstruction omissions: Non-power-of-two sequence lengths Non-equal sequence lengths 74

75 Integration with Parallel CPUs Chowdhury et al. implemented a parallel version of their algorithm No data available for LCS, but results from other algorithms show we should expect ~6 times speedup for LCS using 8 server processors Disadvantages: Number of processors which can be effectively used scales poorly with input size Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550 75

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson