VISUALIZING systems containing many spheres using

Size: px
Start display at page:

Download "VISUALIZING systems containing many spheres using"

Transcription

1 DATA PARALLEL COMPUTING CUDA raytracing algorithm for visualizing discrete element model output Anders Damsgaard Christensen, Abstract A raytracing algorithm is constructed using the CUDA API for visualizing output from a CUDA discrete element model, which outputs spatial information in dynamic particle systems. The raytracing algorithm is optimized with constant memory and compilation flags, and performance is measured as a function of the number of particles and the number of pixels. The execution time is compared to equivalent CPU code, and the speedup under a variety of conditions is found to have a mean value of 55.6 times. Index Terms CUDA, discrete element method, raytracing I. INTRODUCTION VISUALIZING systems containing many spheres using traditional object-order graphics rendering can often result in very high computational requirements, as the usual automated approach is to construct a meshed surface with a specified resolution for each sphere. The memory requirements are thus quite high, as each surface will consist of many vertices. Raytracing [1] is a viable alternative, where spheric entities are saved as data structures with a centre coordinate and a radius. The rendering is performed on the base of these values, which results in a perfectly smooth surfaced sphere. To accelerate the rendering, the algorithm is constructed utilizing the CUDA API [2], where the problem is divided into n m threads, corresponding to the desired output image resolution. Each thread iterates through all particles and applies a simple shading model to determine the final RGB values of the pixel. Previous studies of GPU or CUDA implementations of ray tracing algorithms reported major speedups, compared to corresponding CPU applications (e.g. [3], [4], [5], [6]). None of the software was however found to be open-source and GPL licensed, so a simple raytracer was constructed, customized to render particles, where the data was stored in a specific data format. A. Discrete Element Method The input particle data to the raytracer is the output of a custom CUDA-based Discrete Element Method (DEM) application currently in development. The DEM model is used to numerically simulate the response of a drained, soft, granular sediment bed upon normal stresses and shearing velocities similar to subglacial environments under ice streams [7]. In contrast to laboratory experiments on granular material, the discrete element method [8] approach allows close monitoring of the progressive deformation, where all involved physical Contact: anders.damsgaard@geo.au.dk Webpage: Manuscript, last revision: October 18, parameters of the particles and spatial boundaries are readily available for continuous inspection. The discrete element method (DEM) is a subtype of molecular dynamics (MD), and discretizes time into sufficiently small timesteps, and treats the granular material as discrete grains, interacting through contact forces. Between time steps, the particles are allowed to overlap slightly, and the magnitude of the overlap and the kinematic states of the particles is used to compute normal- and shear components of the contact force. The particles are treated as spherical entities, which simplifies the contact search. The spatial simulation domain is divided using a homogeneous, uniform, cubic grid, which greatly reduces the amount of possible contacts that are checked during each timestep. The grid-particle list is sorted using Thrust 1, and updated each timestep. The new particle positions and kinematic values are updated by inserting the resulting force and torque into Newton s second law, and using a Taylorbased second order integration scheme to calculate new linear and rotational accelerations, velocities and positions. B. Application usage The CUDA DEM application is a command line executable, and writes updated particle information to custom binary files with a specific interval. This raytracing algorithm is constructed to also run from the command line, be noninteractive, and write output images in the PPM image format. This format is chosen to allow rendering to take place on cluster nodes with CUDA compatible devices. Both the CUDA DEM and raytracing applications are opensource 2, although still under heavy development. This document consists of a short introduction to the basic mathematics behind the ray tracing algorithm, an explaination of the implementation using the CUDA API [2] and a presentation of the results. The CUDA device source code and C++ host source code for the ray tracing algorithm can be found in the appendix, along with instructions for compilation and execution of the application. II. RAY TRACING ALGORITHM The goal of the ray tracing algorithm is to compute the shading of each pixel in the image [9]. This is performed by creating a viewing ray from the eye into the scene, finding the closest intersection with a scene object, and computing the resulting color. The general structure of the program is demonstrated in the following pseudo-code:

2 DATA PARALLEL COMPUTING f o r each p i x e l do compute viewing r a y o r i g i n and d i r e c t i o n i t e r a t e t h r o u g h o b j e c t s and f i n d t h e c l o s e s t h i t s e t p i x e l c o l o r t o v a l u e computed from h i t p o i n t, l i g h t, n The implemented code does not utilize recursive rays, since the modeled material grains are matte in appearance. A. Ray generation The rays are in vector form defined as: p(t) = e + t(s e) (1) The perspective can be either orthograpic, where all viewing rays have the same direction, but different starting points, or use perspective projection, where the starting point is the same, but the direction is slightly different [9]. For the purposes of this application, a perspective projection was chosen, as it results in the most natural looking image. The ray data structures were held flexible enough to allow an easy implementation of orthographic perspective, if this is desired at a later point. The ray origin e is the position of the eye, and is constant. The direction is unique for each ray, and is computed using: s e = dw + uu + vv (2) where {u, v, w} are the orthonormal bases of the camera coordinate system, and d is the focal length [9]. The camera coordinates of pixel (i, j) in the image plane, u and v, are calculated by: u = l + (r l)(i + 0.5)/n v = b + (t b)(j + 0.5)/m where l, r, t and b are the positions of the image borders (left, right, top and bottom) in camera space. The values n and m are the number of pixels in each dimension. B. Ray-sphere intersection Given a sphere with a center c, and radius R, a equation can be constrained, where p are all points placed on the sphere surface: (p c) (p c) R 2 = 0 (3) By substituting the points p with ray equation 1, and rearranging the terms, a quadratic equation emerges: (d d)t 2 + 2d (e c)t + (e c) (e c) R 2 = 0 (4) The number of ray steps t is the only unknown, so the number of intersections is found by calculating the determinant: = (2(d (e c))) 2 4(d d)((e c) (e c) R 2 (5) A negative value denotes no intersection between the sphere and the ray, a value of zero means that the ray touches the sphere at a single point (ignored in this implementation), and a positive value denotes that there are two intersections, one when the ray enters the sphere, and one when it exits. In the code, a conditional branch checks wether the determinant is positive. If this is the case, the distance to the intersection in ray steps is calculated using: where t = d (e c) ± η (d d) η = (d (e c)) 2 (d d)((e c) (e c) R 2 ) Only the smallest intersection (t minus ) is calculated, since this marks the point where the sphere enters the particle. If this value is smaller than previous intersection distances, the intersection point p and surface normal n at the intersection point is calculated: (6) p = e + t minus d (7) n = 2(p c) (8) The intersection distance in vector steps (t minus ) is saved in order to allow comparison of the distance with later intersections. C. Pixel shading The pixel is shaded using Lambertian shading [9], where the pixel color is proportional to the angle between the light vector (l) and the surface normal. An ambient shading component is added to simulate global illumination, and prevent that the spheres are completely black: L = k a I a + k d I d max(0, (n l)) (9) where the a and d subscripts denote the ambient and diffusive (Lambertian) components of the ambient/diffusive coefficients (k) and light intensities (I). The pixel color L is calculated once per color channel. D. Computational implementation The above routines were first implemented in CUDA for device execution, and afterwards ported to a CPU C++ equivalent, used for comparing performance. The CPU raytracing algorithm was optimized to shared-memory parallelism using OpenMP [10]. The execution method can be chosen when launching the raytracer from the command line, see the appendix for details. In the CPU implementation, all data was stored in linear arrays of the right size, ensuring 100% memory efficiency. III. CUDA IMPLEMENTATION When constructing the algorithm for execution on the GPGPU device, the data-parallel nature of the problem (SIMD: single instruction, multiple data) is used to deconstruct the rendering task into a single thread per pixel. Each thread iterates through all particles, and ends up writing the resulting color to the image memory. The application starts by reading the discrete element method data from a custom binary file. The particle data, consisting of position vectors in three-dimensional Euclidean space (R 3 ) and particle radii, is stored together in a float4 array, with the particle radius in the w position. This has

3 DATA PARALLEL COMPUTING values. Afterwards, all particle data is transfered from the hostto the device memory. All pixel values are initialized to [R, G, B] = [0, 0, 0], which serves as the image background color. Afterwards, a kernel is executed with a thread for each pixel, testing for intersections between the pixel s viewing ray and all particles, and returning the closest particle. This information is used when computing the shading of the pixel. After all pixel values have been computed, the image data is transfered back to the host memory, and written to the disk. The application ends by liberating dynamically allocated memory on both the device and the host. Fig. 1. Sample output of GPU raytracer rendering of 512 particles. A. Thread and block layout The thread/block layout passed during kernel launches is arranged in the following manner: dim3 t h r e a d s ( 1 6, 16) ; dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; The image pixel position of the thread can be determined from the thread- and block index and dimensions. The layout corresponds to a thread tile size of 256, and a dynamic number of blocks, ensured to fit the image dimensions with only small eventual redundancy [13]. Since this method will initialize extra threads in most situations, all kernels (with return type void) start by checking wether the thread-/block index actually falls inside of the image dimensions: i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; unsigned i n t mempos = x + y blockdim. x griddim. x ; i f ( mempos > p i x e l s ) return ; The linear memory position (mempos) is used as the index when reading or writing to the linear arrays residing in global device memory. Fig. 2. Sample output of GPU raytracer rendering of particles. B. Image output After completing all pixel shading computations on the device, the image data is transfered back to the host memory, and together with a header written to a PPM 3 image file. This file is converted to the PNG format using ImageMagick. large advantages to storing the data in separate float3 and float arrays; Using float4 (instead of float3) data allows coalesced memory access [11] to the arrays of data in device memory, resulting in efficient memory requests and transfers [12], and the data access pattern is coherent and convenient. Other three-component vectors were also stored as float4 for the same reasons, even though this sometimes caused a slight memory redundancy. The image data is saved in a three-channel linear unsigned char array. The algorithm starts by allocating memory on the device for the particle data, the ray parameters, and the image RGB C. Performance Since this simple raytracing algorithm generates a single non-recursive ray for each pixel, which in turn checks all spheres for intersection, the application is expected to scale in the form of O(n m N), where n and m are the output image dimensions in pixels, and N is the number of particles. The data transfer between the host and device is kept at a bare minimum, as the intercommunication is considered a bottleneck in relation to the potential device performance[11]. 3

4 DATA PARALLEL COMPUTING Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels Particles Fig. 3. Performance scaling with varying particle numbers at image dimensions 800 by 800 pixels. Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels e+06 1e+07 Pixels Fig. 4. Performance scaling with varying image dimensions (n m) with 5832 particles. Thread synchronization points are only inserted were necessary, and the code is optimized by the compilers to the target architecture (see appendix). The host execution time was profiled using a clock() based CPU timer from time.h, which was normalized using the constant CLOCKS_PER_SEC. The device execution time was profiled using two cudaevent_t timers, one measuring the time spent in the entire device code section, including device memory allocation, data transfer to- and from the host, execution of the kernels, and memory deallocation. The other timer only measured time spent in the kernels. The threads were synchronized before stopping the timers. A simple CPU timer using clock() will not work, since control is returned to the host thread before the device code has completed all tasks. Figures 3 and 4 show the profiling results, where the number of particles and the image dimensions were varied. With exception of executions with small image dimensions, the kernel execution time results agree with the O(n m N) scaling predicion. The device memory allocation and data transfer was also profiled, and turns out to be only weakly dependant on the particle numbers (fig. 3), but more strongly correlated to image dimensions (fig. 4). As with kernel execution times, the execution time converges against an overhead value at small image dimensions. The CPU time spent in the host kernels proves to be linear with the particle numbers, and linear with the image dimensions. This is due to the non-existant overhead caused by initialization of the device code, and reduced memory transfer. The ratio between CPU computational times and the sum of the device kernel execution time and the host device memory transfer and additional memory allocation was calculated, and had a mean value of 55.6 and a variance of 739 out of the 11 comparative measurements presented in the figures. It should be noted, that the smallest speedups were recorded when using very small image dimensions, probably unrealistic in real use. As the number of particles are not known by compilation, it is not possible to store particle positions and -radii in constant memory. Shared memory was also on purpose avoided, since the memory per thread block (64 kb) would not be sufficient in rendering of simulations containing containing more than particles ( float4 values). The constant memory was however utilized for storing the camera related parameters; the orthonormal base vectors, the observer position, the image dimensions, the focal length, and the light vector. Previous GPU implementations often rely on k-d trees, constructed as an sorting method for static scene objects[3], [5]. A k-d tree implementation would drastically reduce the global memory access induced by each thread, so it is therefore the next logical step with regards to optimizing the ray tracing algorithm presented here. IV. CONCLUSION This document presented the implementation of a basic ray tracing algorithm, utilizing the highly data-parallel nature of the problem when porting the work load to CUDA. Performance tests showed the expected, linear correlation between image dimensions, particle numbers and execution time. Comparisons with an equivalent CPU algorithm showed large speedups, typically up to two orders of magnitude. This speedup did not come at a cost of less correct results. The final product will come into good use during further development and completion of the CUDA DEM particle model, and is ideal since it can be used for offline rendering on dedicated, heterogeneous GPU-CPU computing nodes. The included device code will be the prefered method of execution, whenever the host system allows it. APPENDIX A TEST ENVIRONMENT The raytracing algorithm was developed, tested and profiled on a mid 2010 Mac Pro with a 2.8 Ghz Quad-Core Intel Xeon CPU and a NVIDIA Quadro 4000 for Mac, dedicated to CUDA applications. The CUDA driver was version , the CUDA compilation tools release 4.0, V The GCC tools were version Each CPU core is multithreaded by two threads for a total of 8 threads.

5 DATA PARALLEL COMPUTING The CUDA source code was compiled with nvcc, and linked to g++ compiled C++ source code with g++. For all benchmark tests, the code was compiled with the following commands: g++ c Wall O3 a r c h x86 64 fopenmp... nvcc c u s e f a s t m a t h gencode a r c h =compute 20, code=sm 20 m64... g++ a r c h x86 64 l c u d a l c u d a r t fopenmp. o o r t When profiling device code performance, the application was executed two times, and the time of the second run was noted. This was performed to avoid latency caused by device driver initialization. The host system was measured to have a memory bandwidth of MB/s when transfering data from the host to the device, and MB/s when transfering data from the device to the host. REFERENCES [1] T. Whitted, An improved illumination model for shaded display, Communications of the ACM, vol. 23, no. 6, pp , [2] NVIDIA, CUDA C Programming Guide, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Oct [3] D. Horn, J. Sugerman, M. Houston, and P. Hanrahan, Interactive k-d tree GPU raytracing, Association for Computing Machinery, Inc., pp pp., [4] M. Shih, Y. Chiu, Y. Chen, and C. Chang, Real-time ray tracing with cuda, Algorithms and Architectures for Parallel Processing, pp , [5] S. Popov, J. Günther, H. Seidel, and P. Slusallek, Stackless kd-tree traversal for high performance gpu ray tracing, in Computer Graphics Forum, vol. 26, no. 3. Wiley Online Library, 2007, pp [6] D. Luebke and S. Parker, Interactive ray tracing with cuda, NVIDIA Technical Presentation, SIGGRAPH, [7] D. Evans, E. Phillips, J. Hiemstra, and C. Auton, Subglacial till: formation, sedimentary characteristics and classification, Earth-Science Reviews, vol. 78, no. 1-2, pp , [8] P. Cundall and O. Strack, A discrete numerical model for granular assemblies, Géotechnique, vol. 29, pp , [9] P. Shirley, M. Ashikhmin, M. Gleicher, S. Marschner, E. Reinhard, K. Sung, W. Thompson, and P. Willemsen, Fundamentals of computer graphics, 3rd ed. AK Peters, [10] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portable shared memory parallel programming. The MIT Press, 2007, vol. 10. [11] NVIDIA, CUDA C Best Practices Guide Version, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Aug 2010, ca Patent 95,050. [12] L. Nyland, M. Harris, and J. Prins, Fast n-body simulation with cuda, GPU gems, vol. 3, pp , [13] J. Sanders and E. Kandrot, CUDA by example. Addison-Wesley, APPENDIX B SOURCE CODE The entire source code, as well as input data files, can be found in the following archive sphere-rt.tar.gz The source code is built and run with the commands: make make run With the make run command, the Makefile uses ImageMagick to convert the PPM file to PNG format, and the OS X command open to display the image. Other input data files are included with other particle number magnitudes. The syntax for the raytracer application is the following:. / r t <CPU G P U > <sphere b i n a r y. bin> <width> <h e i g h t > <o u t p u t image. ppm> This appendix contains the following source code files: LISTINGS 1 rt kernel.h rt kernel.cu rt kernel cpu.h rt kernel cpu.cpp A. CUDA raytracing source code Listing 1. rt kernel.h 1 # i f n d e f RT KERNEL H 2 # d e f i n e RT KERNEL H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / C o n s t a n t s 7 c o n s t a n t f l o a t 4 c o n s t u ; 8 c o n s t a n t f l o a t 4 c o n s t v ; 9 c o n s t a n t f l o a t 4 const w ; 10 c o n s t a n t f l o a t 4 c o n s t e y e ; 11 c o n s t a n t f l o a t 4 c o n s t i m g p l a n e ; 12 c o n s t a n t f l o a t c o n s t d ; 13 c o n s t a n t f l o a t 4 c o n s t l i g h t ; / / Host p r o t o t y p e f u n c t i o n s e x t e r n C 19 void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; e x t e r n C 22 void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) ; e x t e r n C 25 i n t r t ( f l o a t 4 p, c o n s t unsigned i n t np, 26 rgb img, const unsigned i n t width, const unsigned i n t height, 27 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 2. rt kernel.cu 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <c u t i l m a t h. h> 3 # i n c l u d e h e a d e r. h 4 # i n c l u d e r t k e r n e l. h 5 6 i n l i n e h o s t d e v i c e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 7 { 8 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 9 } i n l i n e h o s t d e v i c e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 12 { 13 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 14 } / / K e r n e l f o r i n i t i a l i z i n g image d a t a 17 g l o b a l void i m a g e I n i t ( unsigned char img, unsigned i n t p i x e l s ) 18 { 19 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 20 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 21 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 22 unsigned i n t mempos = x + y blockdim. x griddim. x ; 23 i f ( mempos > p i x e l s ) 24 return ; img [ mempos 4] = 0 ; / / Red channel 27 img [ mempos 4 + 1] = 0 ; / / Green channel 28 img [ mempos 4 + 2] = 0 ; / / Blue channel

6 DATA PARALLEL COMPUTING } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 32 g l o b a l void r a y I n i t P e r s p e c t i v e ( f l o a t 4 r a y o r i g o, 33 f l o a t 4 r a y d i r e c t i o n, 34 f l o a t 4 eye, 35 unsigned i n t width, 36 unsigned i n t h e i g h t ) 37 { 38 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 39 i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 40 i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 41 unsigned i n t mempos = i + j blockdim. x griddim. x ; 42 i f ( mempos > width h e i g h t ) 43 return ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 46 f l o a t p u = c o n s t i m g p l a n e. x + ( c o n s t i m g p l a n e. y const imgplane. x ) 47 ( i f ) / width ; 48 f l o a t p v = const imgplane. z + ( const imgplane.w const imgplane. z ) 49 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 52 r a y o r i g o [ mempos ] = c o n s t e y e ; 53 r a y d i r e c t i o n [ mempos ] = c o n s t d const w + p u const u + p v const v ; 54 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 57 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 58 g l o b a l void r a y I n t e r s e c t S p h e r e s ( f l o a t 4 r a y o r i g o, 59 f l o a t 4 r a y d i r e c t i o n, 60 f l o a t 4 p, 61 unsigned char img, 62 unsigned i n t p i x e l s, 63 unsigned i n t np ) 64 { 65 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 66 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 67 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 68 unsigned i n t mempos = x + y blockdim. x griddim. x ; 69 i f ( mempos > p i x e l s ) 70 return ; / / Read r a y d a t a from g l o b a l memory 73 f l o a t 3 e = f 4 t o f 3 ( r a y o r i g o [ mempos ] ) ; 74 f l o a t 3 d = f 4 t o f 3 ( r a y d i r e c t i o n [ mempos ] ) ; 75 / / f l o a t s t e p = l e n g t h ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 78 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 81 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 84 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 87 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 90 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 91 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 94 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 96 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 99 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 100 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 104 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 107 i f ( f a b s ( t minus ) < t d i s t ) { 108 p = e + t minus d ; 109 t d i s t = f a b s ( t minus ) ; 110 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 111 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 118 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 121 f l o a t d o t p r o d = f a b s ( d o t ( n, f 4 t o f 3 ( c o n s t l i g h t ) ) ) ; 122 f l o a t I d = f ; / / L i g h t i n t e n s i t y 123 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 126 f l o a t k a = f ; 127 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 130 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 132 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 134 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 136 } 137 } e x t e r n C 141 h o s t void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 142 { 143 / / Image dimensions in world space ( l, r, b, t ) 144 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 147 f l o a t 4 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base 150 f l o a t 4 v = m a k e f l o a t 4 ( 0. 0 f, 1. 0 f, 0. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 151 f l o a t 4 w = view / l e n g t h ( view ) ; / / w: P o i n t i n g backwards 152 f l o a t 4 u = m a k e f l o a t 4 ( c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), 153 m a k e f l o a t 3 (w. x, w. y, w. z ) ), f ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 157 f l o a t d = l e n g t h ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 160 f l o a t 4 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 4 ( 1. 0 f, 0. 2 f, 0. 6 f, 0. 0 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; cudamemcpytosymbol ( c o n s t u, &u, s i z e o f ( u ) ) ; 165 cudamemcpytosymbol ( c o n s t v, &v, s i z e o f ( v ) ) ; 166 cudamemcpytosymbol ( const w, &w, s i z e o f (w) ) ; 167 cudamemcpytosymbol ( c o n s t e y e, &eye, s i z e o f ( eye ) ) ; 168 cudamemcpytosymbol ( const imgplane, &imgplane, s i z e o f ( imgplane ) ) ; 169 cudamemcpytosymbol ( c o n s t d, &d, s i z e o f ( d ) ) ; 170 cudamemcpytosymbol ( c o n s t l i g h t, &l i g h t, s i z e o f ( l i g h t ) ) ; 171 } 172

7 DATA PARALLEL COMPUTING / / Check f o r CUDA e r r o r s 174 e xtern C 175 host void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) 176 { 177 c u d a E r r o r t e r r = c u d a G e t L a s t E r r o r ( ) ; 178 i f ( e r r!= c u d a S u c c e s s ) { 179 s t d : : c o u t << \ncuda e r r o r d e t e c t e d, c h e c k p o i n t : << c h e c k p o i n t d e s c r i p t i o n 180 << \ n E r r o r s t r i n g : << c u d a G e t E r r o r S t r i n g ( e r r ) << \n ; 181 e x i t ( EXIT FAILURE ) ; 182 } 183 } / / Wrapper f o r t h e r t k e r n e l 187 e xtern C 188 h o s t i n t r t ( f l o a t 4 p, unsigned i n t np, 189 rgb img, unsigned i n t width, unsigned i n t height, 190 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CUDA:\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 197 f l o a t t1, t 2 ; 198 c u d a E v e n t t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; 199 cudaeventcreate (& t1 go ) ; 200 cudaeventcreate (& t2 go ) ; 201 c u d a E v e n t C r e a t e (& t 2 s t o p ) ; 202 c u d a E v e n t C r e a t e (& t 1 s t o p ) ; / / S t a r t t i m e r cudaeventrecord ( t1 go, 0) ; / / A l l o c a t e memory 208 c o u t << A l l o c a t i n g d e v i c e memory\n ; 209 s t a t i c f l o a t 4 p ; / / P a r t i c l e p o s i t i o n s ( x, y, z ) and r a d i u s (w) 210 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 211 s t a t i c f l o a t 4 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 212 s t a t i c f l o a t 4 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 213 cudamalloc ( ( void )& p, np s i z e o f ( f l o a t 4 ) ) ; 214 cudamalloc ( ( void )& img, width h e i g h t 4 s i z e o f ( unsigned char ) ) ; 215 cudamalloc ( ( void )& ray origo, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; 216 cudamalloc ( ( void )& r a y d i r e c t i o n, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; / / T r a n s f e r p a r t i c l e d a t a 219 c o u t << T r a n s f e r i n g p a r t i c l e d a t a : h o s t > d e v i c e \n ; 220 cudamemcpy ( p, p, np s i z e o f ( f l o a t 4 ), cudamemcpyhosttodevice ) ; / / Check f o r e r r o r s a f t e r memory a l l o c a t i o n 223 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r memory a l l o c a t i o n ) ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 226 unsigned i n t p i x e l s = width h e i g h t ; 227 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; 228 dim3 t h r e a d s ( 1 6, 1 6 ) ; 229 dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; / / S t a r t t i m e r cudaeventrecord ( t2 go, 0) ; / / I n i t i a l i z e image t o background c o l o r 235 i m a g e I n i t <<< blocks, t h r e a d s >>>( img, p i x e l s ) ; / / I n i t i a l i z e camera 238 c a m e r a I n i t ( m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 239 m a k e f l o a t 4 ( l o o k a t. x, l o o k a t. y, l o o k a t. z, 0. 0 f ), 240 imgw, hw ratio ) ; 241 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r c a m e r a I n i t ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 244 r a y I n i t P e r s p e c t i v e <<< blocks, t h r e a d s >>>( 245 r a y o r i g o, r a y d i r e c t i o n, 246 m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 247 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 250 r a y I n t e r s e c t S p h e r e s <<< blocks, t h r e a d s >>>( 251 r a y o r i g o, r a y d i r e c t i o n, 252 p, img, p i x e l s, np ) ; / / Make s u r e a l l t h r e a d s a r e done b e f o r e c o n t i n u i n g CPU c o n t r o l s e q u e n c e 256 cudathreadsynchronize ( ) ; / / Check f o r e r r o r s 259 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r k e r n e l e x e c u t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 2 s t o p, 0) ; 263 c u d a E v e n t S y n c h r o n i z e ( t 2 s t o p ) ; / / T r a n s f e r image d a t a from d e v i c e t o h o s t 266 c o u t << T r a n s f e r i n g image d a t a : d e v i c e > h o s t \n ; 267 cudamemcpy ( img, img, width h e i g h t 4 s i z e o f ( unsigned char ), cudamemcpydevicetohost ) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 270 c u d a F r e e ( p ) ; 271 cudafree ( img ) ; 272 c u d a F r e e ( r a y o r i g o ) ; 273 c u d a F r e e ( r a y d i r e c t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 1 s t o p, 0) ; 277 c u d a E v e n t S y n c h r o n i z e ( t 1 s t o p ) ; / / C a l c u l a t e t ime s p e n t i n t 1 and t cudaeventelapsedtime (& t1, t1 go, t 1 s t o p ) ; 281 cudaeventelapsedtime (& t2, t2 go, t 2 s t o p ) ; / / R e p o r t time s p e n t 284 c o u t << Time s p e n t on e n t i r e GPU r o u t i n e : 285 << t 1 << ms\n ; 286 cout << Kernels : << t2 << ms\n 287 << Memory a l l o c. and t r a n s f e r : << t1 t 2 << ms\n ; / / R e t u r n s u c c e s s f u l l y 290 return 0 ; 291 } B. CPU raytracing source code Listing 3. rt kernel cpu.h 1 # i f n d e f RT KERNEL CPU H 2 # d e f i n e RT KERNEL CPU H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / Host p r o t o t y p e f u n c t i o n s 7 8 void c a m e r a I n i t ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; 9 10 i n t r t c p u ( f l o a t 4 p, c o n s t unsigned i n t np, 11 rgb img, const unsigned i n t width, const unsigned i n t height, 12 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 4. rt kernel cpu.cpp 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <math. h> 3 # i n c l u d e <time. h> 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 # i n c l u d e <c u t i l m a t h. h> 6 # i n c l u d e h e a d e r. h 7 # i n c l u d e r t k e r n e l c p u. h 8 9 / / C o n s t a n t s

8 DATA PARALLEL COMPUTING f l o a t 3 c o n s t c u ; 11 f l o a t 3 c o n s t c v ; 12 f l o a t 3 constc w ; 13 f l o a t 3 c o n s t c e y e ; 14 f l o a t 4 c o n s t c i m g p l a n e ; 15 f l o a t c o n s t c d ; 16 f l o a t 3 c o n s t c l i g h t ; i n l i n e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 19 { 20 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 21 } i n l i n e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 24 { 25 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 26 } i n l i n e f l o a t l e n g t h f 3 ( f l o a t 3 i n ) 29 { 30 return s q r t ( i n. x i n. x + i n. y i n. y + i n. z i n. z ) ; 31 } / / K ernel f o r i n i t i a l i z i n g image d a t a 34 void i m a g e I n i t c p u ( unsigned char img, unsigned i n t p i x e l s ) 35 { 36 f o r ( unsigned i n t mempos =0; mempos<p i x e l s ; mempos++) { 37 img [ mempos 4] = 0 ; / / Red channel 38 img [ mempos 4 + 1] = 0 ; / / Green channel 39 img [ mempos 4 + 2] = 0 ; / / Blue channel 40 } 41 } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 44 void r a y I n i t P e r s p e c t i v e c p u ( f l o a t 3 r a y o r i g o, 45 f l o a t 3 r a y d i r e c t i o n, 46 f l o a t 3 eye, 47 unsigned i n t width, 48 unsigned i n t h e i g h t ) 49 { 50 i n t i ; 51 # pragma omp p a r a l l e l f o r 52 f o r ( i =0; i<width ; i ++) { 53 f o r ( unsigned i n t j =0; j<h e i g h t ; j ++) { unsigned i n t mempos = i + j h e i g h t ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 58 f l o a t p u = c o n s t c i m g p l a n e. x + ( c o n s t c i m g p l a n e. y c o n s t c i m g p l a n e. x ) 59 ( i f ) / width ; 60 f l o a t p v = c o n s t c i m g p l a n e. z + ( constc imgplane.w constc imgplane. z ) 61 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 64 r a y o r i g o [ mempos ] = c o n s t c e y e ; 65 r a y d i r e c t i o n [ mempos ] = c o n s t c d constc w + p u constc u + p v constc v ; 66 } 67 } 68 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 71 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 72 void r a y I n t e r s e c t S p h e r e s c p u ( f l o a t 3 r a y o r i g o, 73 f l o a t 3 r a y d i r e c t i o n, 74 f l o a t 4 p, 75 unsigned char img, 76 unsigned i n t p i x e l s, 77 unsigned i n t np ) 78 { 79 i n t mempos ; 80 # pragma omp p a r a l l e l f o r 81 f o r ( mempos =0; mempos<p i x e l s ; mempos++) { / / Read r a y d a t a from g l o b a l memory 84 f l o a t 3 e = r a y o r i g o [ mempos ] ; 85 f l o a t 3 d = r a y d i r e c t i o n [ mempos ] ; 86 / / f l o a t s t e p = l e n g t h f 3 ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 89 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 92 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 95 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 98 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 101 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 102 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 105 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 107 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 110 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 111 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 115 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 118 i f ( f a b s ( t minus ) < t d i s t ) { 119 p = e + t minus d ; 120 t d i s t = f a b s ( t minus ) ; 121 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 122 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 129 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 132 f l o a t d o t p r o d = f a b s ( d o t ( n, c o n s t c l i g h t ) ) ; 133 f l o a t I d = f ; / / L i g h t i n t e n s i t y 134 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 137 f l o a t k a = f ; 138 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 141 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 143 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 145 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 147 } 148 } 149 } void c a m e r a I n i t c p u ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 153 { 154 / / Image dimensions in world space ( l, r, b, t ) 155 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 158 f l o a t 3 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base

9 DATA PARALLEL COMPUTING f l o a t 3 v = m a k e f l o a t 3 ( 0. 0 f, 1. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 162 f l o a t 3 w = view / l e n g t h f 3 ( view ) ; / / w: P o i n t i n g backwards 163 f l o a t 3 u = c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), m a k e f l o a t 3 (w. x, w. y, w. z ) ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 166 f l o a t d = l e n g t h f 3 ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 169 f l o a t 3 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 3 ( 1. 0 f, 0. 2 f, 0. 6 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; constc u = u ; 174 constc v = v ; 175 constc w = w; 176 c o n s t c e y e = eye ; 177 constc imgplane = imgplane ; 178 constc d = d ; 179 c o n s t c l i g h t = l i g h t ; 180 } / / Wrapper f o r t h e r t a l g o r i t h m 184 i n t r t c p u ( f l o a t 4 p, unsigned i n t np, 185 rgb img, unsigned i n t width, unsigned i n t height, 186 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CPU r a y t r a c e r :\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 193 f l o a t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; / / S t a r t t i m e r t1 go = clock ( ) ; / / A l l o c a t e memory 199 c o u t << A l l o c a t i n g d e v i c e memory\n ; 200 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 201 s t a t i c f l o a t 3 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 202 s t a t i c f l o a t 3 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 203 img = new unsigned char [ width h e i g h t 4 ] ; 204 r a y o r i g o = new f l o a t 3 [ width h e i g h t ] ; 205 r a y d i r e c t i o n = new f l o a t 3 [ width h e i g h t ] ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 208 unsigned i n t p i x e l s = width h e i g h t ; 209 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; / / S t a r t t i m e r t2 go = clock ( ) ; / / I n i t i a l i z e image t o background c o l o r 215 i m a g e I n i t c p u ( img, p i x e l s ) ; / / I n i t i a l i z e camera 218 c a m e r a I n i t c p u ( m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 219 m a k e f l o a t 3 ( l o o k a t. x, l o o k a t. y, l o o k a t. z ), 220 imgw, hw ratio ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 223 r a y I n i t P e r s p e c t i v e c p u ( 224 r a y o r i g o, r a y d i r e c t i o n, 225 m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 226 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 229 r a y I n t e r s e c t S p h e r e s c p u ( 230 r a y o r i g o, r a y d i r e c t i o n, 231 p, img, p i x e l s, np ) ; / / Stop t i m e r t 2 s t o p = c l o c k ( ) ; memcpy ( img, img, s i z e o f ( unsigned char ) p i x e l s 4) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 239 d e l e t e [ ] img ; 240 d e l e t e [ ] r a y o r i g o ; 241 d e l e t e [ ] r a y d i r e c t i o n ; / / Stop t i m e r t 1 s t o p = c l o c k ( ) ; / / R e p o r t time s p e n t 247 c o u t << Time s p e n t on e n t i r e CPU r a y t r a c i n g r o u t i n e : 248 << ( t 1 s t o p t1 go ) / CLOCKS PER SEC << ms\n ; 249 c o u t << F u n c t i o n s : << ( t 2 s t o p t2 go ) / CLOCKS PER SEC << ms\n ; / / R e t u r n s u c c e s s f u l l y 252 return 0 ; 253 }

10 DATA PARALLEL COMPUTING A PPENDIX C B LOOPERS This section contains pictures of the unfinished raytracer in a malfunctioning state, which can provide interesting, however unusable, results. Fig. 5. An end result encountered way too many times, e.g. after inadvertently dividing the focal length with zero. Fig. 6. Fig. 8. Colorful result caused by forgetting to normalize the normal vector calculation. Twisted appearance caused by wrong thread/block layout. Fig. 9. Another colorful result with wrong order of particles, cause by using abs() (meant for integers) instead of fabs() (meant for floats). Fig. 7. A problem with the distance determination caused the particles to be displayed in a wrong order.

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.

More information

GPU Acceleration of BCP Procedure for SAT Algorithms

GPU Acceleration of BCP Procedure for SAT Algorithms GPU Acceleration of BCP Procedure for SAT Algorithms Hironori Fujii 1 and Noriyuki Fujimoto 1 1 Graduate School of Science Osaka Prefecture University 1-1 Gakuencho, Nakaku, Sakai, Osaka 599-8531, Japan

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT Paper 1 Logo Civil-Comp Press, 2017 Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, P. Iványi, B.H.V Topping and G. Várady (Editors)

More information

Porting RSL to C++ Ryusuke Villemin, Christophe Hery. Pixar Technical Memo 12-08

Porting RSL to C++ Ryusuke Villemin, Christophe Hery. Pixar Technical Memo 12-08 Porting RSL to C++ Ryusuke Villemin, Christophe Hery Pixar Technical Memo 12-08 1 Introduction In a modern renderer, relying on recursive ray-tracing, the number of shader calls increases by one or two

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology Calculation of ground states of few-body nuclei using NVIDIA CUDA technology M. A. Naumenko 1,a, V. V. Samarin 1, 1 Flerov Laboratory of Nuclear Reactions, Joint Institute for Nuclear Research, 6 Joliot-Curie

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist Using GPUs to Accelerate Online Event Reconstruction at the Large Hadron Collider Dr. Andrea Bocci Applied Physicist On behalf of the CMS Collaboration Discover CERN Inside the Large Hadron Collider at

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Research into GPU accelerated pattern matching for applications in computer security

Research into GPU accelerated pattern matching for applications in computer security Research into GPU accelerated pattern matching for applications in computer security November 4, 2009 Alexander Gee age19@student.canterbury.ac.nz Department of Computer Science and Software Engineering

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Acceleration of WRF on the GPU

Acceleration of WRF on the GPU Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document

More information

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University Prof. Sunil P Khatri (Lab exercise created and tested by Ramu Endluri, He Zhou and Sunil P

More information

Parallel Polynomial Evaluation

Parallel Polynomial Evaluation Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Integer Programming. Wolfram Wiesemann. December 6, 2007

Integer Programming. Wolfram Wiesemann. December 6, 2007 Integer Programming Wolfram Wiesemann December 6, 2007 Contents of this Lecture Revision: Mixed Integer Programming Problems Branch & Bound Algorithms: The Big Picture Solving MIP s: Complete Enumeration

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Using the Timoshenko Beam Bond Model: Example Problem

Using the Timoshenko Beam Bond Model: Example Problem Using the Timoshenko Beam Bond Model: Example Problem Authors: Nick J. BROWN John P. MORRISSEY Jin Y. OOI School of Engineering, University of Edinburgh Jian-Fei CHEN School of Planning, Architecture and

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU

GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU GPU-to-GPU and Host-to-Host Multipattern Sing Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 Email: {xzha, sahni}@cise.ufl.edu

More information

Sparse solver 64 bit and out-of-core addition

Sparse solver 64 bit and out-of-core addition Sparse solver 64 bit and out-of-core addition Prepared By: Richard Link Brian Yuen Martec Limited 1888 Brunswick Street, Suite 400 Halifax, Nova Scotia B3J 3J8 PWGSC Contract Number: W7707-145679 Contract

More information

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on

More information

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures Guangyan Zhang, Zican Huang, Xiaosong Ma SonglinYang, Zhufan Wang, Weimin Zheng Tsinghua University Qatar Computing Research

More information

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS POLITECNICO DI MILANO Facoltà di Ingegneria dell Informazione Corso di Laurea in Ingegneria Informatica DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS Relatore: Prof.

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs Richard Calland University of Liverpool GPU Computing in High Energy Physics University of Pisa, 11th September 2014 Introduction

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

SIMULATION OF ISING SPIN MODEL USING CUDA

SIMULATION OF ISING SPIN MODEL USING CUDA SIMULATION OF ISING SPIN MODEL USING CUDA MIRO JURIŠIĆ Supervisor: dr.sc. Dejan Vinković Split, November 2011 Master Thesis in Physics Department of Physics Faculty of Natural Sciences and Mathematics

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE by Gerrit Muller University of Southeast Norway-NISE e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The purpose of the conceptual view is described. A number of methods or models is given to use

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

AN2044 APPLICATION NOTE

AN2044 APPLICATION NOTE AN2044 APPLICATION NOTE OPERATING PRINCIPALS FOR PRACTISPIN STEPPER MOTOR MOTION CONTROL 1 REQUIREMENTS In operating a stepper motor system one of the most common requirements will be to execute a relative

More information

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets High-Performance Computing, Planet Formation & Searching for Extrasolar Planets Eric B. Ford (UF Astronomy) Research Computing Day September 29, 2011 Postdocs: A. Boley, S. Chatterjee, A. Moorhead, M.

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

All-electron density functional theory on Intel MIC: Elk

All-electron density functional theory on Intel MIC: Elk All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],

More information

GPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment

GPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment GPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment Oxford Many Core Seminar - 19th February Richard Calland rcalland@hep.ph.liv.ac.uk Outline of Talk Neutrino

More information

Multiscale simulations of complex fluid rheology

Multiscale simulations of complex fluid rheology Multiscale simulations of complex fluid rheology Michael P. Howard, Athanassios Z. Panagiotopoulos Department of Chemical and Biological Engineering, Princeton University Arash Nikoubashman Institute of

More information

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation

More information

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s

More information

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa Accelerated Astrophysics: Using NVIDIA GPUs to Simulate and Understand the Universe Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa Cruz brant@ucsc.edu, UC

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Available online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99

Available online at  ScienceDirect. Procedia Engineering 61 (2013 ) 94 99 Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 6 (203 ) 94 99 Parallel Computational Fluid Dynamics Conference (ParCFD203) Simulations of three-dimensional cavity flows with

More information

The following syntax is used to describe a typical irreducible continuum element:

The following syntax is used to describe a typical irreducible continuum element: ELEMENT IRREDUCIBLE T7P0 command.. Synopsis The ELEMENT IRREDUCIBLE T7P0 command is used to describe all irreducible 7-node enhanced quadratic triangular continuum elements that are to be used in mechanical

More information

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

The Fast Multipole Method in molecular dynamics

The Fast Multipole Method in molecular dynamics The Fast Multipole Method in molecular dynamics Berk Hess KTH Royal Institute of Technology, Stockholm, Sweden ADAC6 workshop Zurich, 20-06-2018 Slide BioExcel Slide Molecular Dynamics of biomolecules

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Performance Evaluation of MPI on Weather and Hydrological Models

Performance Evaluation of MPI on Weather and Hydrological Models NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most

More information

McBits: Fast code-based cryptography

McBits: Fast code-based cryptography McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information