VISUALIZING systems containing many spheres using

Size: px

Start display at page:

Download "VISUALIZING systems containing many spheres using"

Ilene Craig
6 years ago
Views:

1 DATA PARALLEL COMPUTING CUDA raytracing algorithm for visualizing discrete element model output Anders Damsgaard Christensen, Abstract A raytracing algorithm is constructed using the CUDA API for visualizing output from a CUDA discrete element model, which outputs spatial information in dynamic particle systems. The raytracing algorithm is optimized with constant memory and compilation flags, and performance is measured as a function of the number of particles and the number of pixels. The execution time is compared to equivalent CPU code, and the speedup under a variety of conditions is found to have a mean value of 55.6 times. Index Terms CUDA, discrete element method, raytracing I. INTRODUCTION VISUALIZING systems containing many spheres using traditional object-order graphics rendering can often result in very high computational requirements, as the usual automated approach is to construct a meshed surface with a specified resolution for each sphere. The memory requirements are thus quite high, as each surface will consist of many vertices. Raytracing [1] is a viable alternative, where spheric entities are saved as data structures with a centre coordinate and a radius. The rendering is performed on the base of these values, which results in a perfectly smooth surfaced sphere. To accelerate the rendering, the algorithm is constructed utilizing the CUDA API [2], where the problem is divided into n m threads, corresponding to the desired output image resolution. Each thread iterates through all particles and applies a simple shading model to determine the final RGB values of the pixel. Previous studies of GPU or CUDA implementations of ray tracing algorithms reported major speedups, compared to corresponding CPU applications (e.g. [3], [4], [5], [6]). None of the software was however found to be open-source and GPL licensed, so a simple raytracer was constructed, customized to render particles, where the data was stored in a specific data format. A. Discrete Element Method The input particle data to the raytracer is the output of a custom CUDA-based Discrete Element Method (DEM) application currently in development. The DEM model is used to numerically simulate the response of a drained, soft, granular sediment bed upon normal stresses and shearing velocities similar to subglacial environments under ice streams [7]. In contrast to laboratory experiments on granular material, the discrete element method [8] approach allows close monitoring of the progressive deformation, where all involved physical Contact: anders.damsgaard@geo.au.dk Webpage: Manuscript, last revision: October 18, parameters of the particles and spatial boundaries are readily available for continuous inspection. The discrete element method (DEM) is a subtype of molecular dynamics (MD), and discretizes time into sufficiently small timesteps, and treats the granular material as discrete grains, interacting through contact forces. Between time steps, the particles are allowed to overlap slightly, and the magnitude of the overlap and the kinematic states of the particles is used to compute normal- and shear components of the contact force. The particles are treated as spherical entities, which simplifies the contact search. The spatial simulation domain is divided using a homogeneous, uniform, cubic grid, which greatly reduces the amount of possible contacts that are checked during each timestep. The grid-particle list is sorted using Thrust 1, and updated each timestep. The new particle positions and kinematic values are updated by inserting the resulting force and torque into Newton s second law, and using a Taylorbased second order integration scheme to calculate new linear and rotational accelerations, velocities and positions. B. Application usage The CUDA DEM application is a command line executable, and writes updated particle information to custom binary files with a specific interval. This raytracing algorithm is constructed to also run from the command line, be noninteractive, and write output images in the PPM image format. This format is chosen to allow rendering to take place on cluster nodes with CUDA compatible devices. Both the CUDA DEM and raytracing applications are opensource 2, although still under heavy development. This document consists of a short introduction to the basic mathematics behind the ray tracing algorithm, an explaination of the implementation using the CUDA API [2] and a presentation of the results. The CUDA device source code and C++ host source code for the ray tracing algorithm can be found in the appendix, along with instructions for compilation and execution of the application. II. RAY TRACING ALGORITHM The goal of the ray tracing algorithm is to compute the shading of each pixel in the image [9]. This is performed by creating a viewing ray from the eye into the scene, finding the closest intersection with a scene object, and computing the resulting color. The general structure of the program is demonstrated in the following pseudo-code:

2 DATA PARALLEL COMPUTING f o r each p i x e l do compute viewing r a y o r i g i n and d i r e c t i o n i t e r a t e t h r o u g h o b j e c t s and f i n d t h e c l o s e s t h i t s e t p i x e l c o l o r t o v a l u e computed from h i t p o i n t, l i g h t, n The implemented code does not utilize recursive rays, since the modeled material grains are matte in appearance. A. Ray generation The rays are in vector form defined as: p(t) = e + t(s e) (1) The perspective can be either orthograpic, where all viewing rays have the same direction, but different starting points, or use perspective projection, where the starting point is the same, but the direction is slightly different [9]. For the purposes of this application, a perspective projection was chosen, as it results in the most natural looking image. The ray data structures were held flexible enough to allow an easy implementation of orthographic perspective, if this is desired at a later point. The ray origin e is the position of the eye, and is constant. The direction is unique for each ray, and is computed using: s e = dw + uu + vv (2) where {u, v, w} are the orthonormal bases of the camera coordinate system, and d is the focal length [9]. The camera coordinates of pixel (i, j) in the image plane, u and v, are calculated by: u = l + (r l)(i + 0.5)/n v = b + (t b)(j + 0.5)/m where l, r, t and b are the positions of the image borders (left, right, top and bottom) in camera space. The values n and m are the number of pixels in each dimension. B. Ray-sphere intersection Given a sphere with a center c, and radius R, a equation can be constrained, where p are all points placed on the sphere surface: (p c) (p c) R 2 = 0 (3) By substituting the points p with ray equation 1, and rearranging the terms, a quadratic equation emerges: (d d)t 2 + 2d (e c)t + (e c) (e c) R 2 = 0 (4) The number of ray steps t is the only unknown, so the number of intersections is found by calculating the determinant: = (2(d (e c))) 2 4(d d)((e c) (e c) R 2 (5) A negative value denotes no intersection between the sphere and the ray, a value of zero means that the ray touches the sphere at a single point (ignored in this implementation), and a positive value denotes that there are two intersections, one when the ray enters the sphere, and one when it exits. In the code, a conditional branch checks wether the determinant is positive. If this is the case, the distance to the intersection in ray steps is calculated using: where t = d (e c) ± η (d d) η = (d (e c)) 2 (d d)((e c) (e c) R 2 ) Only the smallest intersection (t minus ) is calculated, since this marks the point where the sphere enters the particle. If this value is smaller than previous intersection distances, the intersection point p and surface normal n at the intersection point is calculated: (6) p = e + t minus d (7) n = 2(p c) (8) The intersection distance in vector steps (t minus ) is saved in order to allow comparison of the distance with later intersections. C. Pixel shading The pixel is shaded using Lambertian shading [9], where the pixel color is proportional to the angle between the light vector (l) and the surface normal. An ambient shading component is added to simulate global illumination, and prevent that the spheres are completely black: L = k a I a + k d I d max(0, (n l)) (9) where the a and d subscripts denote the ambient and diffusive (Lambertian) components of the ambient/diffusive coefficients (k) and light intensities (I). The pixel color L is calculated once per color channel. D. Computational implementation The above routines were first implemented in CUDA for device execution, and afterwards ported to a CPU C++ equivalent, used for comparing performance. The CPU raytracing algorithm was optimized to shared-memory parallelism using OpenMP [10]. The execution method can be chosen when launching the raytracer from the command line, see the appendix for details. In the CPU implementation, all data was stored in linear arrays of the right size, ensuring 100% memory efficiency. III. CUDA IMPLEMENTATION When constructing the algorithm for execution on the GPGPU device, the data-parallel nature of the problem (SIMD: single instruction, multiple data) is used to deconstruct the rendering task into a single thread per pixel. Each thread iterates through all particles, and ends up writing the resulting color to the image memory. The application starts by reading the discrete element method data from a custom binary file. The particle data, consisting of position vectors in three-dimensional Euclidean space (R 3 ) and particle radii, is stored together in a float4 array, with the particle radius in the w position. This has

DATA PARALLEL COMPUTING 2011 3 values. Afterwards, all particle data is transfered from the hostto the device memory.

3 DATA PARALLEL COMPUTING values. Afterwards, all particle data is transfered from the hostto the device memory. All pixel values are initialized to [R, G, B] = [0, 0, 0], which serves as the image background color. Afterwards, a kernel is executed with a thread for each pixel, testing for intersections between the pixel s viewing ray and all particles, and returning the closest particle. This information is used when computing the shading of the pixel. After all pixel values have been computed, the image data is transfered back to the host memory, and written to the disk. The application ends by liberating dynamically allocated memory on both the device and the host. Fig. 1. Sample output of GPU raytracer rendering of 512 particles. A. Thread and block layout The thread/block layout passed during kernel launches is arranged in the following manner: dim3 t h r e a d s ( 1 6, 16) ; dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; The image pixel position of the thread can be determined from the thread- and block index and dimensions. The layout corresponds to a thread tile size of 256, and a dynamic number of blocks, ensured to fit the image dimensions with only small eventual redundancy [13]. Since this method will initialize extra threads in most situations, all kernels (with return type void) start by checking wether the thread-/block index actually falls inside of the image dimensions: i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; unsigned i n t mempos = x + y blockdim. x griddim. x ; i f ( mempos > p i x e l s ) return ; The linear memory position (mempos) is used as the index when reading or writing to the linear arrays residing in global device memory. Fig. 2. Sample output of GPU raytracer rendering of particles. B. Image output After completing all pixel shading computations on the device, the image data is transfered back to the host memory, and together with a header written to a PPM 3 image file. This file is converted to the PNG format using ImageMagick. large advantages to storing the data in separate float3 and float arrays; Using float4 (instead of float3) data allows coalesced memory access [11] to the arrays of data in device memory, resulting in efficient memory requests and transfers [12], and the data access pattern is coherent and convenient. Other three-component vectors were also stored as float4 for the same reasons, even though this sometimes caused a slight memory redundancy. The image data is saved in a three-channel linear unsigned char array. The algorithm starts by allocating memory on the device for the particle data, the ray parameters, and the image RGB C. Performance Since this simple raytracing algorithm generates a single non-recursive ray for each pixel, which in turn checks all spheres for intersection, the application is expected to scale in the form of O(n m N), where n and m are the output image dimensions in pixels, and N is the number of particles. The data transfer between the host and device is kept at a bare minimum, as the intercommunication is considered a bottleneck in relation to the potential device performance[11]. 3

4 DATA PARALLEL COMPUTING Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels Particles Fig. 3. Performance scaling with varying particle numbers at image dimensions 800 by 800 pixels. Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels e+06 1e+07 Pixels Fig. 4. Performance scaling with varying image dimensions (n m) with 5832 particles. Thread synchronization points are only inserted were necessary, and the code is optimized by the compilers to the target architecture (see appendix). The host execution time was profiled using a clock() based CPU timer from time.h, which was normalized using the constant CLOCKS_PER_SEC. The device execution time was profiled using two cudaevent_t timers, one measuring the time spent in the entire device code section, including device memory allocation, data transfer to- and from the host, execution of the kernels, and memory deallocation. The other timer only measured time spent in the kernels. The threads were synchronized before stopping the timers. A simple CPU timer using clock() will not work, since control is returned to the host thread before the device code has completed all tasks. Figures 3 and 4 show the profiling results, where the number of particles and the image dimensions were varied. With exception of executions with small image dimensions, the kernel execution time results agree with the O(n m N) scaling predicion. The device memory allocation and data transfer was also profiled, and turns out to be only weakly dependant on the particle numbers (fig. 3), but more strongly correlated to image dimensions (fig. 4). As with kernel execution times, the execution time converges against an overhead value at small image dimensions. The CPU time spent in the host kernels proves to be linear with the particle numbers, and linear with the image dimensions. This is due to the non-existant overhead caused by initialization of the device code, and reduced memory transfer. The ratio between CPU computational times and the sum of the device kernel execution time and the host device memory transfer and additional memory allocation was calculated, and had a mean value of 55.6 and a variance of 739 out of the 11 comparative measurements presented in the figures. It should be noted, that the smallest speedups were recorded when using very small image dimensions, probably unrealistic in real use. As the number of particles are not known by compilation, it is not possible to store particle positions and -radii in constant memory. Shared memory was also on purpose avoided, since the memory per thread block (64 kb) would not be sufficient in rendering of simulations containing containing more than particles ( float4 values). The constant memory was however utilized for storing the camera related parameters; the orthonormal base vectors, the observer position, the image dimensions, the focal length, and the light vector. Previous GPU implementations often rely on k-d trees, constructed as an sorting method for static scene objects[3], [5]. A k-d tree implementation would drastically reduce the global memory access induced by each thread, so it is therefore the next logical step with regards to optimizing the ray tracing algorithm presented here. IV. CONCLUSION This document presented the implementation of a basic ray tracing algorithm, utilizing the highly data-parallel nature of the problem when porting the work load to CUDA. Performance tests showed the expected, linear correlation between image dimensions, particle numbers and execution time. Comparisons with an equivalent CPU algorithm showed large speedups, typically up to two orders of magnitude. This speedup did not come at a cost of less correct results. The final product will come into good use during further development and completion of the CUDA DEM particle model, and is ideal since it can be used for offline rendering on dedicated, heterogeneous GPU-CPU computing nodes. The included device code will be the prefered method of execution, whenever the host system allows it. APPENDIX A TEST ENVIRONMENT The raytracing algorithm was developed, tested and profiled on a mid 2010 Mac Pro with a 2.8 Ghz Quad-Core Intel Xeon CPU and a NVIDIA Quadro 4000 for Mac, dedicated to CUDA applications. The CUDA driver was version , the CUDA compilation tools release 4.0, V The GCC tools were version Each CPU core is multithreaded by two threads for a total of 8 threads.

5 DATA PARALLEL COMPUTING The CUDA source code was compiled with nvcc, and linked to g++ compiled C++ source code with g++. For all benchmark tests, the code was compiled with the following commands: g++ c Wall O3 a r c h x86 64 fopenmp... nvcc c u s e f a s t m a t h gencode a r c h =compute 20, code=sm 20 m64... g++ a r c h x86 64 l c u d a l c u d a r t fopenmp. o o r t When profiling device code performance, the application was executed two times, and the time of the second run was noted. This was performed to avoid latency caused by device driver initialization. The host system was measured to have a memory bandwidth of MB/s when transfering data from the host to the device, and MB/s when transfering data from the device to the host. REFERENCES [1] T. Whitted, An improved illumination model for shaded display, Communications of the ACM, vol. 23, no. 6, pp , [2] NVIDIA, CUDA C Programming Guide, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Oct [3] D. Horn, J. Sugerman, M. Houston, and P. Hanrahan, Interactive k-d tree GPU raytracing, Association for Computing Machinery, Inc., pp pp., [4] M. Shih, Y. Chiu, Y. Chen, and C. Chang, Real-time ray tracing with cuda, Algorithms and Architectures for Parallel Processing, pp , [5] S. Popov, J. Günther, H. Seidel, and P. Slusallek, Stackless kd-tree traversal for high performance gpu ray tracing, in Computer Graphics Forum, vol. 26, no. 3. Wiley Online Library, 2007, pp [6] D. Luebke and S. Parker, Interactive ray tracing with cuda, NVIDIA Technical Presentation, SIGGRAPH, [7] D. Evans, E. Phillips, J. Hiemstra, and C. Auton, Subglacial till: formation, sedimentary characteristics and classification, Earth-Science Reviews, vol. 78, no. 1-2, pp , [8] P. Cundall and O. Strack, A discrete numerical model for granular assemblies, Géotechnique, vol. 29, pp , [9] P. Shirley, M. Ashikhmin, M. Gleicher, S. Marschner, E. Reinhard, K. Sung, W. Thompson, and P. Willemsen, Fundamentals of computer graphics, 3rd ed. AK Peters, [10] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portable shared memory parallel programming. The MIT Press, 2007, vol. 10. [11] NVIDIA, CUDA C Best Practices Guide Version, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Aug 2010, ca Patent 95,050. [12] L. Nyland, M. Harris, and J. Prins, Fast n-body simulation with cuda, GPU gems, vol. 3, pp , [13] J. Sanders and E. Kandrot, CUDA by example. Addison-Wesley, APPENDIX B SOURCE CODE The entire source code, as well as input data files, can be found in the following archive sphere-rt.tar.gz The source code is built and run with the commands: make make run With the make run command, the Makefile uses ImageMagick to convert the PPM file to PNG format, and the OS X command open to display the image. Other input data files are included with other particle number magnitudes. The syntax for the raytracer application is the following:. / r t <CPU G P U > <sphere b i n a r y. bin> <width> <h e i g h t > <o u t p u t image. ppm> This appendix contains the following source code files: LISTINGS 1 rt kernel.h rt kernel.cu rt kernel cpu.h rt kernel cpu.cpp A. CUDA raytracing source code Listing 1. rt kernel.h 1 # i f n d e f RT KERNEL H 2 # d e f i n e RT KERNEL H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / C o n s t a n t s 7 c o n s t a n t f l o a t 4 c o n s t u ; 8 c o n s t a n t f l o a t 4 c o n s t v ; 9 c o n s t a n t f l o a t 4 const w ; 10 c o n s t a n t f l o a t 4 c o n s t e y e ; 11 c o n s t a n t f l o a t 4 c o n s t i m g p l a n e ; 12 c o n s t a n t f l o a t c o n s t d ; 13 c o n s t a n t f l o a t 4 c o n s t l i g h t ; / / Host p r o t o t y p e f u n c t i o n s e x t e r n C 19 void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; e x t e r n C 22 void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) ; e x t e r n C 25 i n t r t ( f l o a t 4 p, c o n s t unsigned i n t np, 26 rgb img, const unsigned i n t width, const unsigned i n t height, 27 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 2. rt kernel.cu 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <c u t i l m a t h. h> 3 # i n c l u d e h e a d e r. h 4 # i n c l u d e r t k e r n e l. h 5 6 i n l i n e h o s t d e v i c e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 7 { 8 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 9 } i n l i n e h o s t d e v i c e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 12 { 13 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 14 } / / K e r n e l f o r i n i t i a l i z i n g image d a t a 17 g l o b a l void i m a g e I n i t ( unsigned char img, unsigned i n t p i x e l s ) 18 { 19 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 20 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 21 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 22 unsigned i n t mempos = x + y blockdim. x griddim. x ; 23 i f ( mempos > p i x e l s ) 24 return ; img [ mempos 4] = 0 ; / / Red channel 27 img [ mempos 4 + 1] = 0 ; / / Green channel 28 img [ mempos 4 + 2] = 0 ; / / Blue channel

6 DATA PARALLEL COMPUTING } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 32 g l o b a l void r a y I n i t P e r s p e c t i v e ( f l o a t 4 r a y o r i g o, 33 f l o a t 4 r a y d i r e c t i o n, 34 f l o a t 4 eye, 35 unsigned i n t width, 36 unsigned i n t h e i g h t ) 37 { 38 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 39 i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 40 i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 41 unsigned i n t mempos = i + j blockdim. x griddim. x ; 42 i f ( mempos > width h e i g h t ) 43 return ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 46 f l o a t p u = c o n s t i m g p l a n e. x + ( c o n s t i m g p l a n e. y const imgplane. x ) 47 ( i f ) / width ; 48 f l o a t p v = const imgplane. z + ( const imgplane.w const imgplane. z ) 49 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 52 r a y o r i g o [ mempos ] = c o n s t e y e ; 53 r a y d i r e c t i o n [ mempos ] = c o n s t d const w + p u const u + p v const v ; 54 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 57 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 58 g l o b a l void r a y I n t e r s e c t S p h e r e s ( f l o a t 4 r a y o r i g o, 59 f l o a t 4 r a y d i r e c t i o n, 60 f l o a t 4 p, 61 unsigned char img, 62 unsigned i n t p i x e l s, 63 unsigned i n t np ) 64 { 65 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 66 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 67 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 68 unsigned i n t mempos = x + y blockdim. x griddim. x ; 69 i f ( mempos > p i x e l s ) 70 return ; / / Read r a y d a t a from g l o b a l memory 73 f l o a t 3 e = f 4 t o f 3 ( r a y o r i g o [ mempos ] ) ; 74 f l o a t 3 d = f 4 t o f 3 ( r a y d i r e c t i o n [ mempos ] ) ; 75 / / f l o a t s t e p = l e n g t h ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 78 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 81 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 84 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 87 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 90 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 91 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 94 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 96 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 99 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 100 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 104 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 107 i f ( f a b s ( t minus ) < t d i s t ) { 108 p = e + t minus d ; 109 t d i s t = f a b s ( t minus ) ; 110 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 111 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 118 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 121 f l o a t d o t p r o d = f a b s ( d o t ( n, f 4 t o f 3 ( c o n s t l i g h t ) ) ) ; 122 f l o a t I d = f ; / / L i g h t i n t e n s i t y 123 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 126 f l o a t k a = f ; 127 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 130 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 132 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 134 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 136 } 137 } e x t e r n C 141 h o s t void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 142 { 143 / / Image dimensions in world space ( l, r, b, t ) 144 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 147 f l o a t 4 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base 150 f l o a t 4 v = m a k e f l o a t 4 ( 0. 0 f, 1. 0 f, 0. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 151 f l o a t 4 w = view / l e n g t h ( view ) ; / / w: P o i n t i n g backwards 152 f l o a t 4 u = m a k e f l o a t 4 ( c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), 153 m a k e f l o a t 3 (w. x, w. y, w. z ) ), f ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 157 f l o a t d = l e n g t h ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 160 f l o a t 4 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 4 ( 1. 0 f, 0. 2 f, 0. 6 f, 0. 0 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; cudamemcpytosymbol ( c o n s t u, &u, s i z e o f ( u ) ) ; 165 cudamemcpytosymbol ( c o n s t v, &v, s i z e o f ( v ) ) ; 166 cudamemcpytosymbol ( const w, &w, s i z e o f (w) ) ; 167 cudamemcpytosymbol ( c o n s t e y e, &eye, s i z e o f ( eye ) ) ; 168 cudamemcpytosymbol ( const imgplane, &imgplane, s i z e o f ( imgplane ) ) ; 169 cudamemcpytosymbol ( c o n s t d, &d, s i z e o f ( d ) ) ; 170 cudamemcpytosymbol ( c o n s t l i g h t, &l i g h t, s i z e o f ( l i g h t ) ) ; 171 } 172

7 DATA PARALLEL COMPUTING / / Check f o r CUDA e r r o r s 174 e xtern C 175 host void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) 176 { 177 c u d a E r r o r t e r r = c u d a G e t L a s t E r r o r ( ) ; 178 i f ( e r r!= c u d a S u c c e s s ) { 179 s t d : : c o u t << \ncuda e r r o r d e t e c t e d, c h e c k p o i n t : << c h e c k p o i n t d e s c r i p t i o n 180 << \ n E r r o r s t r i n g : << c u d a G e t E r r o r S t r i n g ( e r r ) << \n ; 181 e x i t ( EXIT FAILURE ) ; 182 } 183 } / / Wrapper f o r t h e r t k e r n e l 187 e xtern C 188 h o s t i n t r t ( f l o a t 4 p, unsigned i n t np, 189 rgb img, unsigned i n t width, unsigned i n t height, 190 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CUDA:\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 197 f l o a t t1, t 2 ; 198 c u d a E v e n t t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; 199 cudaeventcreate (& t1 go ) ; 200 cudaeventcreate (& t2 go ) ; 201 c u d a E v e n t C r e a t e (& t 2 s t o p ) ; 202 c u d a E v e n t C r e a t e (& t 1 s t o p ) ; / / S t a r t t i m e r cudaeventrecord ( t1 go, 0) ; / / A l l o c a t e memory 208 c o u t << A l l o c a t i n g d e v i c e memory\n ; 209 s t a t i c f l o a t 4 p ; / / P a r t i c l e p o s i t i o n s ( x, y, z ) and r a d i u s (w) 210 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 211 s t a t i c f l o a t 4 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 212 s t a t i c f l o a t 4 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 213 cudamalloc ( ( void )& p, np s i z e o f ( f l o a t 4 ) ) ; 214 cudamalloc ( ( void )& img, width h e i g h t 4 s i z e o f ( unsigned char ) ) ; 215 cudamalloc ( ( void )& ray origo, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; 216 cudamalloc ( ( void )& r a y d i r e c t i o n, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; / / T r a n s f e r p a r t i c l e d a t a 219 c o u t << T r a n s f e r i n g p a r t i c l e d a t a : h o s t > d e v i c e \n ; 220 cudamemcpy ( p, p, np s i z e o f ( f l o a t 4 ), cudamemcpyhosttodevice ) ; / / Check f o r e r r o r s a f t e r memory a l l o c a t i o n 223 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r memory a l l o c a t i o n ) ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 226 unsigned i n t p i x e l s = width h e i g h t ; 227 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; 228 dim3 t h r e a d s ( 1 6, 1 6 ) ; 229 dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; / / S t a r t t i m e r cudaeventrecord ( t2 go, 0) ; / / I n i t i a l i z e image t o background c o l o r 235 i m a g e I n i t <<< blocks, t h r e a d s >>>( img, p i x e l s ) ; / / I n i t i a l i z e camera 238 c a m e r a I n i t ( m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 239 m a k e f l o a t 4 ( l o o k a t. x, l o o k a t. y, l o o k a t. z, 0. 0 f ), 240 imgw, hw ratio ) ; 241 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r c a m e r a I n i t ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 244 r a y I n i t P e r s p e c t i v e <<< blocks, t h r e a d s >>>( 245 r a y o r i g o, r a y d i r e c t i o n, 246 m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 247 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 250 r a y I n t e r s e c t S p h e r e s <<< blocks, t h r e a d s >>>( 251 r a y o r i g o, r a y d i r e c t i o n, 252 p, img, p i x e l s, np ) ; / / Make s u r e a l l t h r e a d s a r e done b e f o r e c o n t i n u i n g CPU c o n t r o l s e q u e n c e 256 cudathreadsynchronize ( ) ; / / Check f o r e r r o r s 259 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r k e r n e l e x e c u t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 2 s t o p, 0) ; 263 c u d a E v e n t S y n c h r o n i z e ( t 2 s t o p ) ; / / T r a n s f e r image d a t a from d e v i c e t o h o s t 266 c o u t << T r a n s f e r i n g image d a t a : d e v i c e > h o s t \n ; 267 cudamemcpy ( img, img, width h e i g h t 4 s i z e o f ( unsigned char ), cudamemcpydevicetohost ) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 270 c u d a F r e e ( p ) ; 271 cudafree ( img ) ; 272 c u d a F r e e ( r a y o r i g o ) ; 273 c u d a F r e e ( r a y d i r e c t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 1 s t o p, 0) ; 277 c u d a E v e n t S y n c h r o n i z e ( t 1 s t o p ) ; / / C a l c u l a t e t ime s p e n t i n t 1 and t cudaeventelapsedtime (& t1, t1 go, t 1 s t o p ) ; 281 cudaeventelapsedtime (& t2, t2 go, t 2 s t o p ) ; / / R e p o r t time s p e n t 284 c o u t << Time s p e n t on e n t i r e GPU r o u t i n e : 285 << t 1 << ms\n ; 286 cout << Kernels : << t2 << ms\n 287 << Memory a l l o c. and t r a n s f e r : << t1 t 2 << ms\n ; / / R e t u r n s u c c e s s f u l l y 290 return 0 ; 291 } B. CPU raytracing source code Listing 3. rt kernel cpu.h 1 # i f n d e f RT KERNEL CPU H 2 # d e f i n e RT KERNEL CPU H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / Host p r o t o t y p e f u n c t i o n s 7 8 void c a m e r a I n i t ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; 9 10 i n t r t c p u ( f l o a t 4 p, c o n s t unsigned i n t np, 11 rgb img, const unsigned i n t width, const unsigned i n t height, 12 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 4. rt kernel cpu.cpp 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <math. h> 3 # i n c l u d e <time. h> 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 # i n c l u d e <c u t i l m a t h. h> 6 # i n c l u d e h e a d e r. h 7 # i n c l u d e r t k e r n e l c p u. h 8 9 / / C o n s t a n t s

8 DATA PARALLEL COMPUTING f l o a t 3 c o n s t c u ; 11 f l o a t 3 c o n s t c v ; 12 f l o a t 3 constc w ; 13 f l o a t 3 c o n s t c e y e ; 14 f l o a t 4 c o n s t c i m g p l a n e ; 15 f l o a t c o n s t c d ; 16 f l o a t 3 c o n s t c l i g h t ; i n l i n e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 19 { 20 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 21 } i n l i n e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 24 { 25 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 26 } i n l i n e f l o a t l e n g t h f 3 ( f l o a t 3 i n ) 29 { 30 return s q r t ( i n. x i n. x + i n. y i n. y + i n. z i n. z ) ; 31 } / / K ernel f o r i n i t i a l i z i n g image d a t a 34 void i m a g e I n i t c p u ( unsigned char img, unsigned i n t p i x e l s ) 35 { 36 f o r ( unsigned i n t mempos =0; mempos<p i x e l s ; mempos++) { 37 img [ mempos 4] = 0 ; / / Red channel 38 img [ mempos 4 + 1] = 0 ; / / Green channel 39 img [ mempos 4 + 2] = 0 ; / / Blue channel 40 } 41 } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 44 void r a y I n i t P e r s p e c t i v e c p u ( f l o a t 3 r a y o r i g o, 45 f l o a t 3 r a y d i r e c t i o n, 46 f l o a t 3 eye, 47 unsigned i n t width, 48 unsigned i n t h e i g h t ) 49 { 50 i n t i ; 51 # pragma omp p a r a l l e l f o r 52 f o r ( i =0; i<width ; i ++) { 53 f o r ( unsigned i n t j =0; j<h e i g h t ; j ++) { unsigned i n t mempos = i + j h e i g h t ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 58 f l o a t p u = c o n s t c i m g p l a n e. x + ( c o n s t c i m g p l a n e. y c o n s t c i m g p l a n e. x ) 59 ( i f ) / width ; 60 f l o a t p v = c o n s t c i m g p l a n e. z + ( constc imgplane.w constc imgplane. z ) 61 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 64 r a y o r i g o [ mempos ] = c o n s t c e y e ; 65 r a y d i r e c t i o n [ mempos ] = c o n s t c d constc w + p u constc u + p v constc v ; 66 } 67 } 68 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 71 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 72 void r a y I n t e r s e c t S p h e r e s c p u ( f l o a t 3 r a y o r i g o, 73 f l o a t 3 r a y d i r e c t i o n, 74 f l o a t 4 p, 75 unsigned char img, 76 unsigned i n t p i x e l s, 77 unsigned i n t np ) 78 { 79 i n t mempos ; 80 # pragma omp p a r a l l e l f o r 81 f o r ( mempos =0; mempos<p i x e l s ; mempos++) { / / Read r a y d a t a from g l o b a l memory 84 f l o a t 3 e = r a y o r i g o [ mempos ] ; 85 f l o a t 3 d = r a y d i r e c t i o n [ mempos ] ; 86 / / f l o a t s t e p = l e n g t h f 3 ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 89 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 92 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 95 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 98 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 101 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 102 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 105 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 107 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 110 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 111 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 115 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 118 i f ( f a b s ( t minus ) < t d i s t ) { 119 p = e + t minus d ; 120 t d i s t = f a b s ( t minus ) ; 121 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 122 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 129 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 132 f l o a t d o t p r o d = f a b s ( d o t ( n, c o n s t c l i g h t ) ) ; 133 f l o a t I d = f ; / / L i g h t i n t e n s i t y 134 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 137 f l o a t k a = f ; 138 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 141 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 143 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 145 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 147 } 148 } 149 } void c a m e r a I n i t c p u ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 153 { 154 / / Image dimensions in world space ( l, r, b, t ) 155 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 158 f l o a t 3 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base

9 DATA PARALLEL COMPUTING f l o a t 3 v = m a k e f l o a t 3 ( 0. 0 f, 1. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 162 f l o a t 3 w = view / l e n g t h f 3 ( view ) ; / / w: P o i n t i n g backwards 163 f l o a t 3 u = c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), m a k e f l o a t 3 (w. x, w. y, w. z ) ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 166 f l o a t d = l e n g t h f 3 ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 169 f l o a t 3 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 3 ( 1. 0 f, 0. 2 f, 0. 6 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; constc u = u ; 174 constc v = v ; 175 constc w = w; 176 c o n s t c e y e = eye ; 177 constc imgplane = imgplane ; 178 constc d = d ; 179 c o n s t c l i g h t = l i g h t ; 180 } / / Wrapper f o r t h e r t a l g o r i t h m 184 i n t r t c p u ( f l o a t 4 p, unsigned i n t np, 185 rgb img, unsigned i n t width, unsigned i n t height, 186 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CPU r a y t r a c e r :\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 193 f l o a t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; / / S t a r t t i m e r t1 go = clock ( ) ; / / A l l o c a t e memory 199 c o u t << A l l o c a t i n g d e v i c e memory\n ; 200 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 201 s t a t i c f l o a t 3 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 202 s t a t i c f l o a t 3 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 203 img = new unsigned char [ width h e i g h t 4 ] ; 204 r a y o r i g o = new f l o a t 3 [ width h e i g h t ] ; 205 r a y d i r e c t i o n = new f l o a t 3 [ width h e i g h t ] ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 208 unsigned i n t p i x e l s = width h e i g h t ; 209 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; / / S t a r t t i m e r t2 go = clock ( ) ; / / I n i t i a l i z e image t o background c o l o r 215 i m a g e I n i t c p u ( img, p i x e l s ) ; / / I n i t i a l i z e camera 218 c a m e r a I n i t c p u ( m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 219 m a k e f l o a t 3 ( l o o k a t. x, l o o k a t. y, l o o k a t. z ), 220 imgw, hw ratio ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 223 r a y I n i t P e r s p e c t i v e c p u ( 224 r a y o r i g o, r a y d i r e c t i o n, 225 m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 226 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 229 r a y I n t e r s e c t S p h e r e s c p u ( 230 r a y o r i g o, r a y d i r e c t i o n, 231 p, img, p i x e l s, np ) ; / / Stop t i m e r t 2 s t o p = c l o c k ( ) ; memcpy ( img, img, s i z e o f ( unsigned char ) p i x e l s 4) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 239 d e l e t e [ ] img ; 240 d e l e t e [ ] r a y o r i g o ; 241 d e l e t e [ ] r a y d i r e c t i o n ; / / Stop t i m e r t 1 s t o p = c l o c k ( ) ; / / R e p o r t time s p e n t 247 c o u t << Time s p e n t on e n t i r e CPU r a y t r a c i n g r o u t i n e : 248 << ( t 1 s t o p t1 go ) / CLOCKS PER SEC << ms\n ; 249 c o u t << F u n c t i o n s : << ( t 2 s t o p t2 go ) / CLOCKS PER SEC << ms\n ; / / R e t u r n s u c c e s s f u l l y 252 return 0 ; 253 }

DATA PARALLEL COMPUTING 2011 10 A PPENDIX C B LOOPERS This section contains pictures of

however unusable, results. Fig. 5. An end result encountered way too many times, e.g. after inadvertently dividing the focal length with zero.

Colorful result caused by forgetting to normalize the normal vector calculation.

Another colorful result with wrong order of particles, cause by using abs() (meant for

10 DATA PARALLEL COMPUTING A PPENDIX C B LOOPERS This section contains pictures of the unfinished raytracer in a malfunctioning state, which can provide interesting, however unusable, results. Fig. 5. An end result encountered way too many times, e.g. after inadvertently dividing the focal length with zero. Fig. 6. Fig. 8. Colorful result caused by forgetting to normalize the normal vector calculation. Twisted appearance caused by wrong thread/block layout. Fig. 9. Another colorful result with wrong order of particles, cause by using abs() (meant for integers) instead of fabs() (meant for floats). Fig. 7. A problem with the distance determination caused the particles to be displayed in a wrong order.

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar