VISUALIZING systems containing many spheres using
|
|
- Ilene Craig
- 6 years ago
- Views:
Transcription
1 DATA PARALLEL COMPUTING CUDA raytracing algorithm for visualizing discrete element model output Anders Damsgaard Christensen, Abstract A raytracing algorithm is constructed using the CUDA API for visualizing output from a CUDA discrete element model, which outputs spatial information in dynamic particle systems. The raytracing algorithm is optimized with constant memory and compilation flags, and performance is measured as a function of the number of particles and the number of pixels. The execution time is compared to equivalent CPU code, and the speedup under a variety of conditions is found to have a mean value of 55.6 times. Index Terms CUDA, discrete element method, raytracing I. INTRODUCTION VISUALIZING systems containing many spheres using traditional object-order graphics rendering can often result in very high computational requirements, as the usual automated approach is to construct a meshed surface with a specified resolution for each sphere. The memory requirements are thus quite high, as each surface will consist of many vertices. Raytracing [1] is a viable alternative, where spheric entities are saved as data structures with a centre coordinate and a radius. The rendering is performed on the base of these values, which results in a perfectly smooth surfaced sphere. To accelerate the rendering, the algorithm is constructed utilizing the CUDA API [2], where the problem is divided into n m threads, corresponding to the desired output image resolution. Each thread iterates through all particles and applies a simple shading model to determine the final RGB values of the pixel. Previous studies of GPU or CUDA implementations of ray tracing algorithms reported major speedups, compared to corresponding CPU applications (e.g. [3], [4], [5], [6]). None of the software was however found to be open-source and GPL licensed, so a simple raytracer was constructed, customized to render particles, where the data was stored in a specific data format. A. Discrete Element Method The input particle data to the raytracer is the output of a custom CUDA-based Discrete Element Method (DEM) application currently in development. The DEM model is used to numerically simulate the response of a drained, soft, granular sediment bed upon normal stresses and shearing velocities similar to subglacial environments under ice streams [7]. In contrast to laboratory experiments on granular material, the discrete element method [8] approach allows close monitoring of the progressive deformation, where all involved physical Contact: anders.damsgaard@geo.au.dk Webpage: Manuscript, last revision: October 18, parameters of the particles and spatial boundaries are readily available for continuous inspection. The discrete element method (DEM) is a subtype of molecular dynamics (MD), and discretizes time into sufficiently small timesteps, and treats the granular material as discrete grains, interacting through contact forces. Between time steps, the particles are allowed to overlap slightly, and the magnitude of the overlap and the kinematic states of the particles is used to compute normal- and shear components of the contact force. The particles are treated as spherical entities, which simplifies the contact search. The spatial simulation domain is divided using a homogeneous, uniform, cubic grid, which greatly reduces the amount of possible contacts that are checked during each timestep. The grid-particle list is sorted using Thrust 1, and updated each timestep. The new particle positions and kinematic values are updated by inserting the resulting force and torque into Newton s second law, and using a Taylorbased second order integration scheme to calculate new linear and rotational accelerations, velocities and positions. B. Application usage The CUDA DEM application is a command line executable, and writes updated particle information to custom binary files with a specific interval. This raytracing algorithm is constructed to also run from the command line, be noninteractive, and write output images in the PPM image format. This format is chosen to allow rendering to take place on cluster nodes with CUDA compatible devices. Both the CUDA DEM and raytracing applications are opensource 2, although still under heavy development. This document consists of a short introduction to the basic mathematics behind the ray tracing algorithm, an explaination of the implementation using the CUDA API [2] and a presentation of the results. The CUDA device source code and C++ host source code for the ray tracing algorithm can be found in the appendix, along with instructions for compilation and execution of the application. II. RAY TRACING ALGORITHM The goal of the ray tracing algorithm is to compute the shading of each pixel in the image [9]. This is performed by creating a viewing ray from the eye into the scene, finding the closest intersection with a scene object, and computing the resulting color. The general structure of the program is demonstrated in the following pseudo-code:
2 DATA PARALLEL COMPUTING f o r each p i x e l do compute viewing r a y o r i g i n and d i r e c t i o n i t e r a t e t h r o u g h o b j e c t s and f i n d t h e c l o s e s t h i t s e t p i x e l c o l o r t o v a l u e computed from h i t p o i n t, l i g h t, n The implemented code does not utilize recursive rays, since the modeled material grains are matte in appearance. A. Ray generation The rays are in vector form defined as: p(t) = e + t(s e) (1) The perspective can be either orthograpic, where all viewing rays have the same direction, but different starting points, or use perspective projection, where the starting point is the same, but the direction is slightly different [9]. For the purposes of this application, a perspective projection was chosen, as it results in the most natural looking image. The ray data structures were held flexible enough to allow an easy implementation of orthographic perspective, if this is desired at a later point. The ray origin e is the position of the eye, and is constant. The direction is unique for each ray, and is computed using: s e = dw + uu + vv (2) where {u, v, w} are the orthonormal bases of the camera coordinate system, and d is the focal length [9]. The camera coordinates of pixel (i, j) in the image plane, u and v, are calculated by: u = l + (r l)(i + 0.5)/n v = b + (t b)(j + 0.5)/m where l, r, t and b are the positions of the image borders (left, right, top and bottom) in camera space. The values n and m are the number of pixels in each dimension. B. Ray-sphere intersection Given a sphere with a center c, and radius R, a equation can be constrained, where p are all points placed on the sphere surface: (p c) (p c) R 2 = 0 (3) By substituting the points p with ray equation 1, and rearranging the terms, a quadratic equation emerges: (d d)t 2 + 2d (e c)t + (e c) (e c) R 2 = 0 (4) The number of ray steps t is the only unknown, so the number of intersections is found by calculating the determinant: = (2(d (e c))) 2 4(d d)((e c) (e c) R 2 (5) A negative value denotes no intersection between the sphere and the ray, a value of zero means that the ray touches the sphere at a single point (ignored in this implementation), and a positive value denotes that there are two intersections, one when the ray enters the sphere, and one when it exits. In the code, a conditional branch checks wether the determinant is positive. If this is the case, the distance to the intersection in ray steps is calculated using: where t = d (e c) ± η (d d) η = (d (e c)) 2 (d d)((e c) (e c) R 2 ) Only the smallest intersection (t minus ) is calculated, since this marks the point where the sphere enters the particle. If this value is smaller than previous intersection distances, the intersection point p and surface normal n at the intersection point is calculated: (6) p = e + t minus d (7) n = 2(p c) (8) The intersection distance in vector steps (t minus ) is saved in order to allow comparison of the distance with later intersections. C. Pixel shading The pixel is shaded using Lambertian shading [9], where the pixel color is proportional to the angle between the light vector (l) and the surface normal. An ambient shading component is added to simulate global illumination, and prevent that the spheres are completely black: L = k a I a + k d I d max(0, (n l)) (9) where the a and d subscripts denote the ambient and diffusive (Lambertian) components of the ambient/diffusive coefficients (k) and light intensities (I). The pixel color L is calculated once per color channel. D. Computational implementation The above routines were first implemented in CUDA for device execution, and afterwards ported to a CPU C++ equivalent, used for comparing performance. The CPU raytracing algorithm was optimized to shared-memory parallelism using OpenMP [10]. The execution method can be chosen when launching the raytracer from the command line, see the appendix for details. In the CPU implementation, all data was stored in linear arrays of the right size, ensuring 100% memory efficiency. III. CUDA IMPLEMENTATION When constructing the algorithm for execution on the GPGPU device, the data-parallel nature of the problem (SIMD: single instruction, multiple data) is used to deconstruct the rendering task into a single thread per pixel. Each thread iterates through all particles, and ends up writing the resulting color to the image memory. The application starts by reading the discrete element method data from a custom binary file. The particle data, consisting of position vectors in three-dimensional Euclidean space (R 3 ) and particle radii, is stored together in a float4 array, with the particle radius in the w position. This has
3 DATA PARALLEL COMPUTING values. Afterwards, all particle data is transfered from the hostto the device memory. All pixel values are initialized to [R, G, B] = [0, 0, 0], which serves as the image background color. Afterwards, a kernel is executed with a thread for each pixel, testing for intersections between the pixel s viewing ray and all particles, and returning the closest particle. This information is used when computing the shading of the pixel. After all pixel values have been computed, the image data is transfered back to the host memory, and written to the disk. The application ends by liberating dynamically allocated memory on both the device and the host. Fig. 1. Sample output of GPU raytracer rendering of 512 particles. A. Thread and block layout The thread/block layout passed during kernel launches is arranged in the following manner: dim3 t h r e a d s ( 1 6, 16) ; dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; The image pixel position of the thread can be determined from the thread- and block index and dimensions. The layout corresponds to a thread tile size of 256, and a dynamic number of blocks, ensured to fit the image dimensions with only small eventual redundancy [13]. Since this method will initialize extra threads in most situations, all kernels (with return type void) start by checking wether the thread-/block index actually falls inside of the image dimensions: i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; unsigned i n t mempos = x + y blockdim. x griddim. x ; i f ( mempos > p i x e l s ) return ; The linear memory position (mempos) is used as the index when reading or writing to the linear arrays residing in global device memory. Fig. 2. Sample output of GPU raytracer rendering of particles. B. Image output After completing all pixel shading computations on the device, the image data is transfered back to the host memory, and together with a header written to a PPM 3 image file. This file is converted to the PNG format using ImageMagick. large advantages to storing the data in separate float3 and float arrays; Using float4 (instead of float3) data allows coalesced memory access [11] to the arrays of data in device memory, resulting in efficient memory requests and transfers [12], and the data access pattern is coherent and convenient. Other three-component vectors were also stored as float4 for the same reasons, even though this sometimes caused a slight memory redundancy. The image data is saved in a three-channel linear unsigned char array. The algorithm starts by allocating memory on the device for the particle data, the ray parameters, and the image RGB C. Performance Since this simple raytracing algorithm generates a single non-recursive ray for each pixel, which in turn checks all spheres for intersection, the application is expected to scale in the form of O(n m N), where n and m are the output image dimensions in pixels, and N is the number of particles. The data transfer between the host and device is kept at a bare minimum, as the intercommunication is considered a bottleneck in relation to the potential device performance[11]. 3
4 DATA PARALLEL COMPUTING Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels Particles Fig. 3. Performance scaling with varying particle numbers at image dimensions 800 by 800 pixels. Time [ms] 1e Device kernels Host<->device transfer and allocation Host kernels e+06 1e+07 Pixels Fig. 4. Performance scaling with varying image dimensions (n m) with 5832 particles. Thread synchronization points are only inserted were necessary, and the code is optimized by the compilers to the target architecture (see appendix). The host execution time was profiled using a clock() based CPU timer from time.h, which was normalized using the constant CLOCKS_PER_SEC. The device execution time was profiled using two cudaevent_t timers, one measuring the time spent in the entire device code section, including device memory allocation, data transfer to- and from the host, execution of the kernels, and memory deallocation. The other timer only measured time spent in the kernels. The threads were synchronized before stopping the timers. A simple CPU timer using clock() will not work, since control is returned to the host thread before the device code has completed all tasks. Figures 3 and 4 show the profiling results, where the number of particles and the image dimensions were varied. With exception of executions with small image dimensions, the kernel execution time results agree with the O(n m N) scaling predicion. The device memory allocation and data transfer was also profiled, and turns out to be only weakly dependant on the particle numbers (fig. 3), but more strongly correlated to image dimensions (fig. 4). As with kernel execution times, the execution time converges against an overhead value at small image dimensions. The CPU time spent in the host kernels proves to be linear with the particle numbers, and linear with the image dimensions. This is due to the non-existant overhead caused by initialization of the device code, and reduced memory transfer. The ratio between CPU computational times and the sum of the device kernel execution time and the host device memory transfer and additional memory allocation was calculated, and had a mean value of 55.6 and a variance of 739 out of the 11 comparative measurements presented in the figures. It should be noted, that the smallest speedups were recorded when using very small image dimensions, probably unrealistic in real use. As the number of particles are not known by compilation, it is not possible to store particle positions and -radii in constant memory. Shared memory was also on purpose avoided, since the memory per thread block (64 kb) would not be sufficient in rendering of simulations containing containing more than particles ( float4 values). The constant memory was however utilized for storing the camera related parameters; the orthonormal base vectors, the observer position, the image dimensions, the focal length, and the light vector. Previous GPU implementations often rely on k-d trees, constructed as an sorting method for static scene objects[3], [5]. A k-d tree implementation would drastically reduce the global memory access induced by each thread, so it is therefore the next logical step with regards to optimizing the ray tracing algorithm presented here. IV. CONCLUSION This document presented the implementation of a basic ray tracing algorithm, utilizing the highly data-parallel nature of the problem when porting the work load to CUDA. Performance tests showed the expected, linear correlation between image dimensions, particle numbers and execution time. Comparisons with an equivalent CPU algorithm showed large speedups, typically up to two orders of magnitude. This speedup did not come at a cost of less correct results. The final product will come into good use during further development and completion of the CUDA DEM particle model, and is ideal since it can be used for offline rendering on dedicated, heterogeneous GPU-CPU computing nodes. The included device code will be the prefered method of execution, whenever the host system allows it. APPENDIX A TEST ENVIRONMENT The raytracing algorithm was developed, tested and profiled on a mid 2010 Mac Pro with a 2.8 Ghz Quad-Core Intel Xeon CPU and a NVIDIA Quadro 4000 for Mac, dedicated to CUDA applications. The CUDA driver was version , the CUDA compilation tools release 4.0, V The GCC tools were version Each CPU core is multithreaded by two threads for a total of 8 threads.
5 DATA PARALLEL COMPUTING The CUDA source code was compiled with nvcc, and linked to g++ compiled C++ source code with g++. For all benchmark tests, the code was compiled with the following commands: g++ c Wall O3 a r c h x86 64 fopenmp... nvcc c u s e f a s t m a t h gencode a r c h =compute 20, code=sm 20 m64... g++ a r c h x86 64 l c u d a l c u d a r t fopenmp. o o r t When profiling device code performance, the application was executed two times, and the time of the second run was noted. This was performed to avoid latency caused by device driver initialization. The host system was measured to have a memory bandwidth of MB/s when transfering data from the host to the device, and MB/s when transfering data from the device to the host. REFERENCES [1] T. Whitted, An improved illumination model for shaded display, Communications of the ACM, vol. 23, no. 6, pp , [2] NVIDIA, CUDA C Programming Guide, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Oct [3] D. Horn, J. Sugerman, M. Houston, and P. Hanrahan, Interactive k-d tree GPU raytracing, Association for Computing Machinery, Inc., pp pp., [4] M. Shih, Y. Chiu, Y. Chen, and C. Chang, Real-time ray tracing with cuda, Algorithms and Architectures for Parallel Processing, pp , [5] S. Popov, J. Günther, H. Seidel, and P. Slusallek, Stackless kd-tree traversal for high performance gpu ray tracing, in Computer Graphics Forum, vol. 26, no. 3. Wiley Online Library, 2007, pp [6] D. Luebke and S. Parker, Interactive ray tracing with cuda, NVIDIA Technical Presentation, SIGGRAPH, [7] D. Evans, E. Phillips, J. Hiemstra, and C. Auton, Subglacial till: formation, sedimentary characteristics and classification, Earth-Science Reviews, vol. 78, no. 1-2, pp , [8] P. Cundall and O. Strack, A discrete numerical model for granular assemblies, Géotechnique, vol. 29, pp , [9] P. Shirley, M. Ashikhmin, M. Gleicher, S. Marschner, E. Reinhard, K. Sung, W. Thompson, and P. Willemsen, Fundamentals of computer graphics, 3rd ed. AK Peters, [10] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portable shared memory parallel programming. The MIT Press, 2007, vol. 10. [11] NVIDIA, CUDA C Best Practices Guide Version, 3rd ed., NVIDIA Corporation: Santa Clara, CA, USA, Aug 2010, ca Patent 95,050. [12] L. Nyland, M. Harris, and J. Prins, Fast n-body simulation with cuda, GPU gems, vol. 3, pp , [13] J. Sanders and E. Kandrot, CUDA by example. Addison-Wesley, APPENDIX B SOURCE CODE The entire source code, as well as input data files, can be found in the following archive sphere-rt.tar.gz The source code is built and run with the commands: make make run With the make run command, the Makefile uses ImageMagick to convert the PPM file to PNG format, and the OS X command open to display the image. Other input data files are included with other particle number magnitudes. The syntax for the raytracer application is the following:. / r t <CPU G P U > <sphere b i n a r y. bin> <width> <h e i g h t > <o u t p u t image. ppm> This appendix contains the following source code files: LISTINGS 1 rt kernel.h rt kernel.cu rt kernel cpu.h rt kernel cpu.cpp A. CUDA raytracing source code Listing 1. rt kernel.h 1 # i f n d e f RT KERNEL H 2 # d e f i n e RT KERNEL H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / C o n s t a n t s 7 c o n s t a n t f l o a t 4 c o n s t u ; 8 c o n s t a n t f l o a t 4 c o n s t v ; 9 c o n s t a n t f l o a t 4 const w ; 10 c o n s t a n t f l o a t 4 c o n s t e y e ; 11 c o n s t a n t f l o a t 4 c o n s t i m g p l a n e ; 12 c o n s t a n t f l o a t c o n s t d ; 13 c o n s t a n t f l o a t 4 c o n s t l i g h t ; / / Host p r o t o t y p e f u n c t i o n s e x t e r n C 19 void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; e x t e r n C 22 void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) ; e x t e r n C 25 i n t r t ( f l o a t 4 p, c o n s t unsigned i n t np, 26 rgb img, const unsigned i n t width, const unsigned i n t height, 27 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 2. rt kernel.cu 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <c u t i l m a t h. h> 3 # i n c l u d e h e a d e r. h 4 # i n c l u d e r t k e r n e l. h 5 6 i n l i n e h o s t d e v i c e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 7 { 8 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 9 } i n l i n e h o s t d e v i c e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 12 { 13 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 14 } / / K e r n e l f o r i n i t i a l i z i n g image d a t a 17 g l o b a l void i m a g e I n i t ( unsigned char img, unsigned i n t p i x e l s ) 18 { 19 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 20 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 21 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 22 unsigned i n t mempos = x + y blockdim. x griddim. x ; 23 i f ( mempos > p i x e l s ) 24 return ; img [ mempos 4] = 0 ; / / Red channel 27 img [ mempos 4 + 1] = 0 ; / / Green channel 28 img [ mempos 4 + 2] = 0 ; / / Blue channel
6 DATA PARALLEL COMPUTING } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 32 g l o b a l void r a y I n i t P e r s p e c t i v e ( f l o a t 4 r a y o r i g o, 33 f l o a t 4 r a y d i r e c t i o n, 34 f l o a t 4 eye, 35 unsigned i n t width, 36 unsigned i n t h e i g h t ) 37 { 38 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 39 i n t i = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 40 i n t j = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 41 unsigned i n t mempos = i + j blockdim. x griddim. x ; 42 i f ( mempos > width h e i g h t ) 43 return ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 46 f l o a t p u = c o n s t i m g p l a n e. x + ( c o n s t i m g p l a n e. y const imgplane. x ) 47 ( i f ) / width ; 48 f l o a t p v = const imgplane. z + ( const imgplane.w const imgplane. z ) 49 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 52 r a y o r i g o [ mempos ] = c o n s t e y e ; 53 r a y d i r e c t i o n [ mempos ] = c o n s t d const w + p u const u + p v const v ; 54 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 57 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 58 g l o b a l void r a y I n t e r s e c t S p h e r e s ( f l o a t 4 r a y o r i g o, 59 f l o a t 4 r a y d i r e c t i o n, 60 f l o a t 4 p, 61 unsigned char img, 62 unsigned i n t p i x e l s, 63 unsigned i n t np ) 64 { 65 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x 66 i n t x = t h r e a d I d x. x + b l o c k I d x. x blockdim. x ; 67 i n t y = t h r e a d I d x. y + b l o c k I d x. y blockdim. y ; 68 unsigned i n t mempos = x + y blockdim. x griddim. x ; 69 i f ( mempos > p i x e l s ) 70 return ; / / Read r a y d a t a from g l o b a l memory 73 f l o a t 3 e = f 4 t o f 3 ( r a y o r i g o [ mempos ] ) ; 74 f l o a t 3 d = f 4 t o f 3 ( r a y d i r e c t i o n [ mempos ] ) ; 75 / / f l o a t s t e p = l e n g t h ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 78 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 81 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 84 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 87 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 90 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 91 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 94 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 96 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 99 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 100 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 104 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 107 i f ( f a b s ( t minus ) < t d i s t ) { 108 p = e + t minus d ; 109 t d i s t = f a b s ( t minus ) ; 110 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 111 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 118 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 121 f l o a t d o t p r o d = f a b s ( d o t ( n, f 4 t o f 3 ( c o n s t l i g h t ) ) ) ; 122 f l o a t I d = f ; / / L i g h t i n t e n s i t y 123 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 126 f l o a t k a = f ; 127 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 130 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 132 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 134 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 136 } 137 } e x t e r n C 141 h o s t void c a m e r a I n i t ( f l o a t 4 eye, f l o a t 4 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 142 { 143 / / Image dimensions in world space ( l, r, b, t ) 144 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 147 f l o a t 4 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base 150 f l o a t 4 v = m a k e f l o a t 4 ( 0. 0 f, 1. 0 f, 0. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 151 f l o a t 4 w = view / l e n g t h ( view ) ; / / w: P o i n t i n g backwards 152 f l o a t 4 u = m a k e f l o a t 4 ( c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), 153 m a k e f l o a t 3 (w. x, w. y, w. z ) ), f ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 157 f l o a t d = l e n g t h ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 160 f l o a t 4 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 4 ( 1. 0 f, 0. 2 f, 0. 6 f, 0. 0 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; cudamemcpytosymbol ( c o n s t u, &u, s i z e o f ( u ) ) ; 165 cudamemcpytosymbol ( c o n s t v, &v, s i z e o f ( v ) ) ; 166 cudamemcpytosymbol ( const w, &w, s i z e o f (w) ) ; 167 cudamemcpytosymbol ( c o n s t e y e, &eye, s i z e o f ( eye ) ) ; 168 cudamemcpytosymbol ( const imgplane, &imgplane, s i z e o f ( imgplane ) ) ; 169 cudamemcpytosymbol ( c o n s t d, &d, s i z e o f ( d ) ) ; 170 cudamemcpytosymbol ( c o n s t l i g h t, &l i g h t, s i z e o f ( l i g h t ) ) ; 171 } 172
7 DATA PARALLEL COMPUTING / / Check f o r CUDA e r r o r s 174 e xtern C 175 host void checkforcudaerrors ( c o n s t char c h e c k p o i n t d e s c r i p t i o n ) 176 { 177 c u d a E r r o r t e r r = c u d a G e t L a s t E r r o r ( ) ; 178 i f ( e r r!= c u d a S u c c e s s ) { 179 s t d : : c o u t << \ncuda e r r o r d e t e c t e d, c h e c k p o i n t : << c h e c k p o i n t d e s c r i p t i o n 180 << \ n E r r o r s t r i n g : << c u d a G e t E r r o r S t r i n g ( e r r ) << \n ; 181 e x i t ( EXIT FAILURE ) ; 182 } 183 } / / Wrapper f o r t h e r t k e r n e l 187 e xtern C 188 h o s t i n t r t ( f l o a t 4 p, unsigned i n t np, 189 rgb img, unsigned i n t width, unsigned i n t height, 190 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CUDA:\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 197 f l o a t t1, t 2 ; 198 c u d a E v e n t t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; 199 cudaeventcreate (& t1 go ) ; 200 cudaeventcreate (& t2 go ) ; 201 c u d a E v e n t C r e a t e (& t 2 s t o p ) ; 202 c u d a E v e n t C r e a t e (& t 1 s t o p ) ; / / S t a r t t i m e r cudaeventrecord ( t1 go, 0) ; / / A l l o c a t e memory 208 c o u t << A l l o c a t i n g d e v i c e memory\n ; 209 s t a t i c f l o a t 4 p ; / / P a r t i c l e p o s i t i o n s ( x, y, z ) and r a d i u s (w) 210 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 211 s t a t i c f l o a t 4 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 212 s t a t i c f l o a t 4 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 213 cudamalloc ( ( void )& p, np s i z e o f ( f l o a t 4 ) ) ; 214 cudamalloc ( ( void )& img, width h e i g h t 4 s i z e o f ( unsigned char ) ) ; 215 cudamalloc ( ( void )& ray origo, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; 216 cudamalloc ( ( void )& r a y d i r e c t i o n, width h e i g h t s i z e o f ( f l o a t 4 ) ) ; / / T r a n s f e r p a r t i c l e d a t a 219 c o u t << T r a n s f e r i n g p a r t i c l e d a t a : h o s t > d e v i c e \n ; 220 cudamemcpy ( p, p, np s i z e o f ( f l o a t 4 ), cudamemcpyhosttodevice ) ; / / Check f o r e r r o r s a f t e r memory a l l o c a t i o n 223 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r memory a l l o c a t i o n ) ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 226 unsigned i n t p i x e l s = width h e i g h t ; 227 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; 228 dim3 t h r e a d s ( 1 6, 1 6 ) ; 229 dim3 b l o c k s ( ( width +15) / 1 6, ( h e i g h t +15) / 1 6 ) ; / / S t a r t t i m e r cudaeventrecord ( t2 go, 0) ; / / I n i t i a l i z e image t o background c o l o r 235 i m a g e I n i t <<< blocks, t h r e a d s >>>( img, p i x e l s ) ; / / I n i t i a l i z e camera 238 c a m e r a I n i t ( m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 239 m a k e f l o a t 4 ( l o o k a t. x, l o o k a t. y, l o o k a t. z, 0. 0 f ), 240 imgw, hw ratio ) ; 241 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r c a m e r a I n i t ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 244 r a y I n i t P e r s p e c t i v e <<< blocks, t h r e a d s >>>( 245 r a y o r i g o, r a y d i r e c t i o n, 246 m a k e f l o a t 4 ( eye. x, eye. y, eye. z, 0. 0 f ), 247 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 250 r a y I n t e r s e c t S p h e r e s <<< blocks, t h r e a d s >>>( 251 r a y o r i g o, r a y d i r e c t i o n, 252 p, img, p i x e l s, np ) ; / / Make s u r e a l l t h r e a d s a r e done b e f o r e c o n t i n u i n g CPU c o n t r o l s e q u e n c e 256 cudathreadsynchronize ( ) ; / / Check f o r e r r o r s 259 c h e c k F o r C u d a E r r o r s ( CUDA e r r o r a f t e r k e r n e l e x e c u t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 2 s t o p, 0) ; 263 c u d a E v e n t S y n c h r o n i z e ( t 2 s t o p ) ; / / T r a n s f e r image d a t a from d e v i c e t o h o s t 266 c o u t << T r a n s f e r i n g image d a t a : d e v i c e > h o s t \n ; 267 cudamemcpy ( img, img, width h e i g h t 4 s i z e o f ( unsigned char ), cudamemcpydevicetohost ) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 270 c u d a F r e e ( p ) ; 271 cudafree ( img ) ; 272 c u d a F r e e ( r a y o r i g o ) ; 273 c u d a F r e e ( r a y d i r e c t i o n ) ; / / Stop t i m e r cudaeventrecord ( t 1 s t o p, 0) ; 277 c u d a E v e n t S y n c h r o n i z e ( t 1 s t o p ) ; / / C a l c u l a t e t ime s p e n t i n t 1 and t cudaeventelapsedtime (& t1, t1 go, t 1 s t o p ) ; 281 cudaeventelapsedtime (& t2, t2 go, t 2 s t o p ) ; / / R e p o r t time s p e n t 284 c o u t << Time s p e n t on e n t i r e GPU r o u t i n e : 285 << t 1 << ms\n ; 286 cout << Kernels : << t2 << ms\n 287 << Memory a l l o c. and t r a n s f e r : << t1 t 2 << ms\n ; / / R e t u r n s u c c e s s f u l l y 290 return 0 ; 291 } B. CPU raytracing source code Listing 3. rt kernel cpu.h 1 # i f n d e f RT KERNEL CPU H 2 # d e f i n e RT KERNEL CPU H 3 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 6 / / Host p r o t o t y p e f u n c t i o n s 7 8 void c a m e r a I n i t ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) ; 9 10 i n t r t c p u ( f l o a t 4 p, c o n s t unsigned i n t np, 11 rgb img, const unsigned i n t width, const unsigned i n t height, 12 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) ; # e n d i f Listing 4. rt kernel cpu.cpp 1 # i n c l u d e <i o s t r e a m > 2 # i n c l u d e <math. h> 3 # i n c l u d e <time. h> 4 # i n c l u d e <v e c t o r f u n c t i o n s. h> 5 # i n c l u d e <c u t i l m a t h. h> 6 # i n c l u d e h e a d e r. h 7 # i n c l u d e r t k e r n e l c p u. h 8 9 / / C o n s t a n t s
8 DATA PARALLEL COMPUTING f l o a t 3 c o n s t c u ; 11 f l o a t 3 c o n s t c v ; 12 f l o a t 3 constc w ; 13 f l o a t 3 c o n s t c e y e ; 14 f l o a t 4 c o n s t c i m g p l a n e ; 15 f l o a t c o n s t c d ; 16 f l o a t 3 c o n s t c l i g h t ; i n l i n e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n ) 19 { 20 return m a k e f l o a t 3 ( i n. x, i n. y, i n. z ) ; 21 } i n l i n e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n ) 24 { 25 return m a k e f l o a t 4 ( i n. x, i n. y, i n. z, 0. 0 f ) ; 26 } i n l i n e f l o a t l e n g t h f 3 ( f l o a t 3 i n ) 29 { 30 return s q r t ( i n. x i n. x + i n. y i n. y + i n. z i n. z ) ; 31 } / / K ernel f o r i n i t i a l i z i n g image d a t a 34 void i m a g e I n i t c p u ( unsigned char img, unsigned i n t p i x e l s ) 35 { 36 f o r ( unsigned i n t mempos =0; mempos<p i x e l s ; mempos++) { 37 img [ mempos 4] = 0 ; / / Red channel 38 img [ mempos 4 + 1] = 0 ; / / Green channel 39 img [ mempos 4 + 2] = 0 ; / / Blue channel 40 } 41 } / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s 44 void r a y I n i t P e r s p e c t i v e c p u ( f l o a t 3 r a y o r i g o, 45 f l o a t 3 r a y d i r e c t i o n, 46 f l o a t 3 eye, 47 unsigned i n t width, 48 unsigned i n t h e i g h t ) 49 { 50 i n t i ; 51 # pragma omp p a r a l l e l f o r 52 f o r ( i =0; i<width ; i ++) { 53 f o r ( unsigned i n t j =0; j<h e i g h t ; j ++) { unsigned i n t mempos = i + j h e i g h t ; / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e 58 f l o a t p u = c o n s t c i m g p l a n e. x + ( c o n s t c i m g p l a n e. y c o n s t c i m g p l a n e. x ) 59 ( i f ) / width ; 60 f l o a t p v = c o n s t c i m g p l a n e. z + ( constc imgplane.w constc imgplane. z ) 61 ( j f ) / h e i g h t ; / / Write r a y o r i g o and d i r e c t i o n t o g l o b a l memory 64 r a y o r i g o [ mempos ] = c o n s t c e y e ; 65 r a y d i r e c t i o n [ mempos ] = c o n s t c d constc w + p u constc u + p v constc v ; 66 } 67 } 68 } / / Check w ether t h e p i x e l s viewing r a y i n t e r s e c t s with t h e s p h e r e s, 71 / / and shade t h e p i x e l c o r r e s p o n d i n g l y 72 void r a y I n t e r s e c t S p h e r e s c p u ( f l o a t 3 r a y o r i g o, 73 f l o a t 3 r a y d i r e c t i o n, 74 f l o a t 4 p, 75 unsigned char img, 76 unsigned i n t p i x e l s, 77 unsigned i n t np ) 78 { 79 i n t mempos ; 80 # pragma omp p a r a l l e l f o r 81 f o r ( mempos =0; mempos<p i x e l s ; mempos++) { / / Read r a y d a t a from g l o b a l memory 84 f l o a t 3 e = r a y o r i g o [ mempos ] ; 85 f l o a t 3 d = r a y d i r e c t i o n [ mempos ] ; 86 / / f l o a t s t e p = l e n g t h f 3 ( d ) ; / / Distance, in ray steps, between o b j e c t and eye i n i t i a l i z e d with a l a r g e v a l u e 89 f l o a t t d i s t = 1 e10f ; / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n 92 f l o a t 3 n ; / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s 95 f l o a t 3 p ; / / I t e r a t e t h r o u g h a l l p a r t i c l e s 98 f o r ( unsigned i n t i =0; i<np ; i ++) { / / Read s p h e r e c o o r d i n a t e and r a d i u s 101 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ; 102 f l o a t R = p [ i ]. w; / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 4AC 105 f l o a t D e l t a = ( 2. 0 f d o t ( d, ( e c ) ) ) ( 2. 0 f d o t ( d, ( e c ) ) ) / / Bˆ f d o t ( d, d ) / / 4 A 107 ( d o t ( ( e c ), ( e c ) ) R R) ; / / C / / I f t h e d e t e r m i n a n t i s p o s i t i v e, t h e r e a r e two s o l u t i o n s 110 / / One where t h e l i n e e n t e r s t h e sphere, and one where i t e x i t s 111 i f ( D e l t a > 0. 0 f ) { / / C a l c u l a t e r o o t s, S h i r l e y 2009 p f l o a t t minus = ( ( d o t( d, ( e c ) ) s q r t ( d o t ( d, ( e c ) ) d o t ( d, ( e c ) ) d o t ( d, d ) 115 ( d o t ( ( e c ), ( e c ) ) R R) ) ) / d o t ( d, d ) ) ; / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n p r e v i o u s v a l u e s 118 i f ( f a b s ( t minus ) < t d i s t ) { 119 p = e + t minus d ; 120 t d i s t = f a b s ( t minus ) ; 121 n = n o r m a l i z e ( 2. 0 f ( p c ) ) ; / / S u r f a c e normal 122 } } / / End of s o l u t i o n b r a n c h } / / End of p a r t i c l e loop / / Write p i x e l c o l o r 129 i f ( t d i s t < 1 e10 ) { / / L a m b e r t i a n s h a d i n g p a r a m e t e r s 132 f l o a t d o t p r o d = f a b s ( d o t ( n, c o n s t c l i g h t ) ) ; 133 f l o a t I d = f ; / / L i g h t i n t e n s i t y 134 f l o a t k d = 5. 0 f ; / / D i f f u s e c o e f f i c i e n t / / Ambient s h a d i n g 137 f l o a t k a = f ; 138 f l o a t I a = 5. 0 f ; / / Write s h a d i n g model v a l u e s t o p i x e l c o l o r c h a n n e l s 141 img [ mempos 4] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.48 f ) ; 143 img [ mempos 4 + 1] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.41 f ) ; 145 img [ mempos 4 + 2] = ( unsigned char ) ( ( k d I d d o t p r o d k a I a ) 0.27 f ) ; 147 } 148 } 149 } void c a m e r a I n i t c p u ( f l o a t 3 eye, f l o a t 3 l o o k a t, f l o a t imgw, f l o a t h w r a t i o ) 153 { 154 / / Image dimensions in world space ( l, r, b, t ) 155 f l o a t 4 imgplane = m a k e f l o a t 4 ( 0.5 f imgw, 0. 5 f imgw, 0.5 f imgw hw ratio, 0. 5 f imgw h w r a t i o ) ; / / The view v e c t o r 158 f l o a t 3 view = eye l o o k a t ; / / C o n s t r u c t t h e camera view o r t h o n o r m a l base
9 DATA PARALLEL COMPUTING f l o a t 3 v = m a k e f l o a t 3 ( 0. 0 f, 1. 0 f, 0. 0 f ) ; / / v : P o i n t i n g upward 162 f l o a t 3 w = view / l e n g t h f 3 ( view ) ; / / w: P o i n t i n g backwards 163 f l o a t 3 u = c r o s s ( m a k e f l o a t 3 ( v. x, v. y, v. z ), m a k e f l o a t 3 (w. x, w. y, w. z ) ) ; / / u : P o i n t i n g r i g h t / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h 166 f l o a t d = l e n g t h f 3 ( view ) 0.8 f ; / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e ) 169 f l o a t 3 l i g h t = n o r m a l i z e ( 1.0 f eye m a k e f l o a t 3 ( 1. 0 f, 0. 2 f, 0. 6 f ) ) ; s t d : : c o u t << T r a n s f e r i n g camera v a l u e s t o c o n s t a n t memory\n ; constc u = u ; 174 constc v = v ; 175 constc w = w; 176 c o n s t c e y e = eye ; 177 constc imgplane = imgplane ; 178 constc d = d ; 179 c o n s t c l i g h t = l i g h t ; 180 } / / Wrapper f o r t h e r t a l g o r i t h m 184 i n t r t c p u ( f l o a t 4 p, unsigned i n t np, 185 rgb img, unsigned i n t width, unsigned i n t height, 186 f3 o r i g o, f3 L, f3 eye, f3 l o o k a t, f l o a t imgw ) { using s t d : : c o u t ; c o u t << I n i t i a l i z i n g CPU r a y t r a c e r :\ n ; / / I n i t i a l i z e GPU timestamp r e c o r d e r s 193 f l o a t t1 go, t2 go, t 1 s t o p, t 2 s t o p ; / / S t a r t t i m e r t1 go = clock ( ) ; / / A l l o c a t e memory 199 c o u t << A l l o c a t i n g d e v i c e memory\n ; 200 s t a t i c unsigned char img ; / / RGBw v a l u e s i n image 201 s t a t i c f l o a t 3 r a y o r i g o ; / / Ray o r i g o ( x, y, z ) 202 s t a t i c f l o a t 3 r a y d i r e c t i o n ; / / Ray d i r e c t i o n ( x, y, z ) 203 img = new unsigned char [ width h e i g h t 4 ] ; 204 r a y o r i g o = new f l o a t 3 [ width h e i g h t ] ; 205 r a y d i r e c t i o n = new f l o a t 3 [ width h e i g h t ] ; / / Arrange t h r e a d / b l o c k s t r u c t u r e 208 unsigned i n t p i x e l s = width h e i g h t ; 209 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) width ; / / S t a r t t i m e r t2 go = clock ( ) ; / / I n i t i a l i z e image t o background c o l o r 215 i m a g e I n i t c p u ( img, p i x e l s ) ; / / I n i t i a l i z e camera 218 c a m e r a I n i t c p u ( m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 219 m a k e f l o a t 3 ( l o o k a t. x, l o o k a t. y, l o o k a t. z ), 220 imgw, hw ratio ) ; / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n 223 r a y I n i t P e r s p e c t i v e c p u ( 224 r a y o r i g o, r a y d i r e c t i o n, 225 m a k e f l o a t 3 ( eye. x, eye. y, eye. z ), 226 width, h e i g h t ) ; / / Find c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s 229 r a y I n t e r s e c t S p h e r e s c p u ( 230 r a y o r i g o, r a y d i r e c t i o n, 231 p, img, p i x e l s, np ) ; / / Stop t i m e r t 2 s t o p = c l o c k ( ) ; memcpy ( img, img, s i z e o f ( unsigned char ) p i x e l s 4) ; / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory 239 d e l e t e [ ] img ; 240 d e l e t e [ ] r a y o r i g o ; 241 d e l e t e [ ] r a y d i r e c t i o n ; / / Stop t i m e r t 1 s t o p = c l o c k ( ) ; / / R e p o r t time s p e n t 247 c o u t << Time s p e n t on e n t i r e CPU r a y t r a c i n g r o u t i n e : 248 << ( t 1 s t o p t1 go ) / CLOCKS PER SEC << ms\n ; 249 c o u t << F u n c t i o n s : << ( t 2 s t o p t2 go ) / CLOCKS PER SEC << ms\n ; / / R e t u r n s u c c e s s f u l l y 252 return 0 ; 253 }
10 DATA PARALLEL COMPUTING A PPENDIX C B LOOPERS This section contains pictures of the unfinished raytracer in a malfunctioning state, which can provide interesting, however unusable, results. Fig. 5. An end result encountered way too many times, e.g. after inadvertently dividing the focal length with zero. Fig. 6. Fig. 8. Colorful result caused by forgetting to normalize the normal vector calculation. Twisted appearance caused by wrong thread/block layout. Fig. 9. Another colorful result with wrong order of particles, cause by using abs() (meant for integers) instead of fabs() (meant for floats). Fig. 7. A problem with the distance determination caused the particles to be displayed in a wrong order.
High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team
High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationCS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/
CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationParallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)
Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)
More informationS0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA
S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationCOMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD
XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins
More informationAuto-Tuning Complex Array Layouts for GPUs - Supplemental Material
BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.
More informationGPU Acceleration of BCP Procedure for SAT Algorithms
GPU Acceleration of BCP Procedure for SAT Algorithms Hironori Fujii 1 and Noriyuki Fujimoto 1 1 Graduate School of Science Osaka Prefecture University 1-1 Gakuencho, Nakaku, Sakai, Osaka 599-8531, Japan
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More informationLogo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT
Paper 1 Logo Civil-Comp Press, 2017 Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, P. Iványi, B.H.V Topping and G. Várady (Editors)
More informationPorting RSL to C++ Ryusuke Villemin, Christophe Hery. Pixar Technical Memo 12-08
Porting RSL to C++ Ryusuke Villemin, Christophe Hery Pixar Technical Memo 12-08 1 Introduction In a modern renderer, relying on recursive ray-tracing, the number of shader calls increases by one or two
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationMultiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU
Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationCalculation of ground states of few-body nuclei using NVIDIA CUDA technology
Calculation of ground states of few-body nuclei using NVIDIA CUDA technology M. A. Naumenko 1,a, V. V. Samarin 1, 1 Flerov Laboratory of Nuclear Reactions, Joint Institute for Nuclear Research, 6 Joliot-Curie
More informationPanorama des modèles et outils de programmation parallèle
Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &
More informationDr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist
Using GPUs to Accelerate Online Event Reconstruction at the Large Hadron Collider Dr. Andrea Bocci Applied Physicist On behalf of the CMS Collaboration Discover CERN Inside the Large Hadron Collider at
More informationInformation Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and
Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationResearch into GPU accelerated pattern matching for applications in computer security
Research into GPU accelerated pattern matching for applications in computer security November 4, 2009 Alexander Gee age19@student.canterbury.ac.nz Department of Computer Science and Software Engineering
More informationScalable and Power-Efficient Data Mining Kernels
Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationHIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationAcceleration of WRF on the GPU
Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST
More informationIntroduction to Benchmark Test for Multi-scale Computational Materials Software
Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)
More informationNEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0
NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented
More informationParallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB
Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document
More informationRandomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory
Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University
ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University Prof. Sunil P Khatri (Lab exercise created and tested by Ramu Endluri, He Zhou and Sunil P
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationParallelism of MRT Lattice Boltzmann Method based on Multi-GPUs
Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationInteger Programming. Wolfram Wiesemann. December 6, 2007
Integer Programming Wolfram Wiesemann December 6, 2007 Contents of this Lecture Revision: Mixed Integer Programming Problems Branch & Bound Algorithms: The Big Picture Solving MIP s: Complete Enumeration
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationUsing the Timoshenko Beam Bond Model: Example Problem
Using the Timoshenko Beam Bond Model: Example Problem Authors: Nick J. BROWN John P. MORRISSEY Jin Y. OOI School of Engineering, University of Edinburgh Jian-Fei CHEN School of Planning, Architecture and
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationGPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU
GPU-to-GPU and Host-to-Host Multipattern Sing Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 Email: {xzha, sahni}@cise.ufl.edu
More informationSparse solver 64 bit and out-of-core addition
Sparse solver 64 bit and out-of-core addition Prepared By: Richard Link Brian Yuen Martec Limited 1888 Brunswick Street, Suite 400 Halifax, Nova Scotia B3J 3J8 PWGSC Contract Number: W7707-145679 Contract
More informationA microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS
GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on
More informationRAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures
RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures Guangyan Zhang, Zican Huang, Xiaosong Ma SonglinYang, Zhufan Wang, Weimin Zheng Tsinghua University Qatar Computing Research
More informationPOLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS
POLITECNICO DI MILANO Facoltà di Ingegneria dell Informazione Corso di Laurea in Ingegneria Informatica DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS Relatore: Prof.
More informationPSEUDORANDOM numbers are very important in practice
Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationAccelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland
Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs Richard Calland University of Liverpool GPU Computing in High Energy Physics University of Pisa, 11th September 2014 Introduction
More informationFirst, a look at using OpenACC on WRF subroutine advance_w dynamics routine
First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationSIMULATION OF ISING SPIN MODEL USING CUDA
SIMULATION OF ISING SPIN MODEL USING CUDA MIRO JURIŠIĆ Supervisor: dr.sc. Dejan Vinković Split, November 2011 Master Thesis in Physics Department of Physics Faculty of Natural Sciences and Mathematics
More informationAccelerating Quantum Chromodynamics Calculations with GPUs
Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationThe conceptual view. by Gerrit Muller University of Southeast Norway-NISE
by Gerrit Muller University of Southeast Norway-NISE e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The purpose of the conceptual view is described. A number of methods or models is given to use
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationAN2044 APPLICATION NOTE
AN2044 APPLICATION NOTE OPERATING PRINCIPALS FOR PRACTISPIN STEPPER MOTOR MOTION CONTROL 1 REQUIREMENTS In operating a stepper motor system one of the most common requirements will be to execute a relative
More informationHigh-Performance Computing, Planet Formation & Searching for Extrasolar Planets
High-Performance Computing, Planet Formation & Searching for Extrasolar Planets Eric B. Ford (UF Astronomy) Research Computing Day September 29, 2011 Postdocs: A. Boley, S. Chatterjee, A. Moorhead, M.
More informationCS 700: Quantitative Methods & Experimental Design in Computer Science
CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,
More informationAll-electron density functional theory on Intel MIC: Elk
All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],
More informationGPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment
GPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment Oxford Many Core Seminar - 19th February Richard Calland rcalland@hep.ph.liv.ac.uk Outline of Talk Neutrino
More informationMultiscale simulations of complex fluid rheology
Multiscale simulations of complex fluid rheology Michael P. Howard, Athanassios Z. Panagiotopoulos Department of Chemical and Biological Engineering, Princeton University Arash Nikoubashman Institute of
More informationTuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation
More informationACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS
ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s
More informationProf. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa
Accelerated Astrophysics: Using NVIDIA GPUs to Simulate and Understand the Universe Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa Cruz brant@ucsc.edu, UC
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationAvailable online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 6 (203 ) 94 99 Parallel Computational Fluid Dynamics Conference (ParCFD203) Simulations of three-dimensional cavity flows with
More informationThe following syntax is used to describe a typical irreducible continuum element:
ELEMENT IRREDUCIBLE T7P0 command.. Synopsis The ELEMENT IRREDUCIBLE T7P0 command is used to describe all irreducible 7-node enhanced quadratic triangular continuum elements that are to be used in mechanical
More informationS Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems
S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationThe Fast Multipole Method in molecular dynamics
The Fast Multipole Method in molecular dynamics Berk Hess KTH Royal Institute of Technology, Stockholm, Sweden ADAC6 workshop Zurich, 20-06-2018 Slide BioExcel Slide Molecular Dynamics of biomolecules
More informationAn Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))
An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationPerformance Evaluation of MPI on Weather and Hydrological Models
NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most
More informationMcBits: Fast code-based cryptography
McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More information上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose
上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More information