Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Size: px

Start display at page:

Download "Code Generation for GPU Accelerators in the Domain of Image Preprocessing"

Amy Wilkinson
5 years ago
Views:

1 Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl, April 4, 2013

.. remove noise detect edges compensate for detector defects.

2 Motivation: Medical Image Preprocessing Keep X-ray dosage and contrast agent as low as possible noisy images. Improve quality of medical images... remove noise detect edges compensate for detector defects... an implementation must be as efficient as possible. most algorithms are well known. What for do we need code generation?

Challenge: How To Target Multiple Architectures?

high performance on different target hardware

hand-written code productivity algorithm

details from programmer portability support

algorithm description support different target

3 Challenge: How To Target Multiple Architectures? Efficient code generation for different target architectures. Domain-specific Languages performance portable: high performance on different target hardware competitive: comparable performance to hand-written code productivity algorithm description at a high-level hide low-level details from programmer portability support different target architectures from the same algorithm description support different target languages from the same algorithm description Domain-specific languages offer both functional- and perfomance-portability.

4 Agenda HIPA cc Results Summary

5 HIPAcc

6 HIPAcc: The Heterogeneous Image Processing Acceleration Framework C++ embedded DSL Domain Source-to-Source Compiler Clang/LLVM Knowledge Architecture Knowledge CUDA OpenCL C/C++ Renderscript (GPU) (x86/gpu) (x86) (x86/arm/gpu) CUDA/OpenCL/Renderscript Runtime Library 3

7 Domain Analysis: Image Processing Kernel Categorization Identified three groups of kernels: Point operators [HPPC 11] each pixel is updated uninfluential of other pixels Local operators [IPDPS 12] centered at the pixel it is applied to [0,0] bounded to the neighborhood [ m,+m] [ n,+n] operator can be applied in parallel Global operators [ISPDC 12] pixels of the whole image contribute to result for instance, reduction operators [HPPC 11] Richard Membarth, Anton Lokhmotov, and Jürgen Teich. Generating GPU Code from a High-level Representation for Image Processing Kernels. In: Proceedings of the 5th Workshop on Highly Parallel Processing on a Chip (HPPC). Springer. Bordeaux, France, Aug. 30, 2011, pp DOI: / _31. [IPDPS 12] Richard Membarth et al. Generating Device-specific GPU Code for Local Operators in Medical Imaging. In: Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE. Shanghai, China, May 21 25, 2012, pp DOI: /IPDPS [ISPDC 12] Richard Membarth et al. Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators based on a Domain-Specific Language for Medical Imaging. In: Proceedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC). IEEE. Munich, Germany, June 25 29, 2012, pp DOI: /ISPDC

8 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask Output image. Crop of output image. Crop of output image with offset. 5

9 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask Image and boundary. Image crop. Image crop with offset. Image offset. 5

10 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask F G H J K L N O P B C D F G H J K L N O P B C D F G H J K L E F G H I J K L M N O P A B C D E F G H I J K L M N O P A B C D E F G H I J K L Repeat E F G I J K M N O A B C E F G I J K M N O A B C E F G I J K A A A A A A A A A A A A E E E I I I M M M M M M M M M M M M A B C D A B C D A B C D A B C D E F G H I J K L M N O P M N O P M N O P M N O P Clamp D D D D D D D D D D D D H H H L L L P P P P P P P P P P P P K G C J F B I E A C B A G F E K J I O N M E I M F J N G K O I J K L E F G H A B C D A B C D E F G H I J K L M N O P M N O P I J K L E F G H Mirror B F J C G K D H L D C B H G F L K J P O N P L H O K G N J F Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q A B C D Q Q Q Q Q Q E F G H Q Q Q Q Q Q I J K L Q Q Q Q Q Q M N O P Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Constant 5

11 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask f(x,y) x y

12 HIPA cc Example: Gaussian Blur 1 /*... */ 2 Image <uchar > in(width, height ); 3 Image <float > out(width, height ); 4 Mask <float > mask (size, size ); 5 6 in = in_image ; 7 out = out_image ; 8 mask = filter_mask ; 9 10 BoundaryCondition bound (in, mask, BOUNDARY_CLAMP ); AccessorLF < uchar > acc( bound, width, height, 0, 0); IterationSpace <float > iter (out, width /2, height /2, width /4, height /4); GaussianBlur filter ( iter, acc, mask, size /2); 17 filter.execute (); out_image = out ; 6

13 HIPA cc Example: Gaussian Blur Kernel 1 class GaussianBlur : public Kernel < float > { 2 Mask <float > mask ; 3 Accessor < uchar > input ; 4 size_t range ; 5 6 public : 7 GaussianBlur ( IterationSpace < float > iter, Accessor < uchar > acc, 8 Mask < float > mask, size_t range ) 9 : Kernel ( iter ), input (acc), mask ( mask ), range ( range ) { 10 addaccessor ( acc ); 11 } void kernel () { 14 float sum =.0f; 15 for ( int yf = - range ; yf <= range ; ++ yf) 16 for ( int xf = - range ; xf <= range ; ++ xf) 17 sum += input (xf, yf) * mask (xf, yf); 18 output () = sum; 19 } 20 }; 7

14 HIPA cc Example: Gaussian Blur Kernel + Lambda Function 1 class GaussianBlur : public Kernel < float > { 2 Mask <float > mask ; 3 Accessor < uchar > input ; 4 size_t range ; 5 6 public : 7 GaussianBlur ( IterationSpace < float > iter, Accessor < uchar > acc, 8 Mask < float > mask, size_t range ) 9 : Kernel ( iter ), input (acc), mask ( mask ), range ( range ) { 10 addaccessor ( acc ); 11 } 12 Lambda function for convolution 13 void kernel () { 14 output () = convolve (mask, HipaccSUM, [&]() { 15 return input ( mask ) * mask (); 16 }); 17 } 18 }; 7

15 Efficient Code Generation for Boundary Handling A A A B C D A B C D D D A A A B C D A B C D D D A A A B C D A B C D D D BH_TL BH_T BH_TR E E E F G H E F G H H H I I I J K L I J K L L L M M M N O P M N O P P P BH_L BH_NO BH_R A A A B C D A B C D D D E E E F G H E F G H H H I I I J K L BH_BL I BH_B J K L BH_BR L L M M M N O P M N O P P P M M M N O P M N O P P P generates 10 different code variants minimize executed conditionals minimize divergence block index determines code variant limit necessary boundary handling with respect to mask size and image padding M M M N O P M N O P P P 8

16 Mapping of GPU Memory Accesses Image Preprocessing mostly load compute store memory bound Architecture Model Memory Type global memory constant memory texture memory surface memory local memory Optimizations memory access alignment unrolling target (e. g., Kepler35, SouthernIsland, Midgard,... ) 9

17 Results

18 Results Gaussian Blur 5 5 (separated) 12 vs. 238 lines of CUDA code Undef. Clamp Repeat Mirror Const. naïve crash OpenCV n/a RapidMind crash n/a Halide n/a 8.93 n/a n/a n/a NPP (8-bit) 6.86 n/a n/a n/a n/a HIPA cc CUDA Image of pixels on a Tesla C2050. Times in ms. Bilateral Grid Filter 62 vs. 386 lines of CUDA code GTX 680 i Handtuned CUDA n/a Halide HIPA cc CUDA n/a HIPA cc OpenCL Image of pixels. Times in ms. 10

19 Summary

20 Summary HIPA cc domain-specific language for image preprocessing optimizations tailored to application domain architecture model for GPU accelerators target-specific code generator for CUDA and OpenCL transformations based on domain knowledge architecture information provides: performance, productivity and portability Recent Work three new target backends Renderscript Renderscript GPU Filterscript support for Midgard architecture (ARM Mali T604) 11

21 Questions? HIPA cc framework sources released under Simplified BSD License. 12

22 Results: Gaussian Blur, 5 5 window, pixel Samsung Exynos 5 execution time [ms] CPU GPU CL RS FS CV SI Qualcomm Snapdragon S4 Pro execution time [ms] CPU GPU RS FS CV SI 13

A CUDA Solver for Helmholtz Equation

Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College