Quantile Precision Issues in CUDA
|
|
- Willa Mosley
- 5 years ago
- Views:
Transcription
1 Quantile Precision Issues in CUDA Thomas Luu and William Shaw UCL, Dec Corrections to Set up and Introduction In[1]:= This Mathematica notebook uses the high-precision arithmetic in Mathematica and its CUDALink tools to investigate the precision of kernels for the normal quantile. First we load our high-precision benchmark. Note that this has been verified to >24 sig figs by comparison with the Steinbrecher-Shaw analysis (EJAM 200). := u - 1D In[5]:= In[6]:= Out[6]= In[7]:= Next we load the Mathematica CUDALink: Needs@"CUDALink`"D CUDAQ@D True CUDAInformation@D Out[7]= 1 Ø Name Ø Quadro 4000, Clock Rate Ø , Compute Capabilities Ø 2., GPU Overlap Ø 1, Maximum Block Dimensions Ø 1024, 1024, 64, Maximum Grid Dimensions Ø , , , Maximum Threads Per Block Ø 1024, Maximum Shared Memory Per Block Ø , Total Constant Memory Ø , Warp Size Ø 32, Maximum Pitch Ø , Maximum Registers Per Block Ø 32 76, Texture Alignment Ø 512, Multiprocessor Count Ø, Core Count Ø 256, Execution Timeout Ø 1, Integrated Ø False, Can Map Host Memory Ø True, Compute Mode Ø Default, Texture1D Width Ø , Texture2D Width Ø , Texture2D Height Ø , Texture3D Width Ø 204, Texture3D Height Ø 204, Texture3D Depth Ø 204, Texture2D Array Width Ø 16 34, Texture2D Array Height Ø 16 34, Texture2D Array Slices Ø 204, Surface Alignment Ø 512, Concurrent Kernels Ø True, ECC Enabled Ø False, TCC Enabled Ø False, Total Memory Ø CUDAFunctions for the quantile - float mode In[20]:= Hre is the one built into CUDA 4 kernelcuda = CUDAFunctionLoad@" global void cuda_norminvf_kernelhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = HfloatLM_SQRT2 * erfinvfh2.0f*u - 1.0fL; ", "cuda_norminvf_kernel", "Float", "Float", 512D; Here is one for the kernel of Appendix A of Shaw-Luu-Brickman 2011:
2 2 QuantilePrecisionInCUDA.nb In[21]:= kernelws = CUDAFunctionLoad@" inline device float ws_norminvfhfloat ul float half_minus_u = 0.5f - u; float v, p, q; float one_minus_x = copysignfh2.0f*u, half_minus_ul; if Hhalf_minus_u 0.0fL one_minus_x += 2.0f; v = - logfhone_minus_xl; p = e-4f; p = p*v f; p = p*v f; p = p*v f; p = p*v f; p = p*v f; q = e-6f; q = q*v f; q = q*v f; q = q*v f; q = q*v f; q = q*v f; q = q*v + 1.0f; return - fdividefhp, ql * copysignfhv, half_minus_ul; global void ws_norminvf_kernelhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = ws_norminvfhul; ", "ws_norminvf_kernel", "Float", "Float", 512D; Here are the kernels based on the paper by Giles and the web site by Acklam, for float operation
3 QuantilePrecisionInCUDA.nb 3 In[]:= kernelmg = CUDAFunctionLoad@" inline device float MBG_erfinvHfloat xl float w, p; w = - logfhh1.0f-xl*h1.0f+xll; if H w f L w = w f; p = e-0f; p = e-07f + p*w; p = e-06f + p*w; p = e-06f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; else w = sqrtfhwl f; p = f; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; return p*x; global void mg_norminvfhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = HfloatLM_SQRT2 * MBG_erfinvH2.0f*u - 1.0fL; ", "mg_norminvf", "Float", "Float", 512D;
4 4 QuantilePrecisionInCUDA.nb In[9]:= kernelacklam = CUDAFunctionLoad@" global void AcklamsingleHfloat * aa, float * bbl const float a@6d = e+01f, e+02f, e+02f, e+02f, e+01f, ; const float b@5d = e+01f, e+02f, e+00f e+02f, e+01f, e+01f ; const float c@6d = e-03f, e-01f, e+00f, e+00f, e+00f, e+00f ; const float d@4d = ; e-03f, e+00f, float p, q, t, u; e-01f, e+00f int idx = blockidx.x * blockdim.x + threadidx.x; p = aa@idxd; if Hp1.0f-pL q=p; else q=1.0f-p; if Hq > fL ê* Rational approximation for central region. *ê u = q-0.5f; t = u*u; u = u*hhhhha@0d*t+a@1dl*t+a@2dl*t+a@3dl*t+a@4dl*t+a@5dl êhhhhhb@0d*t+b@1dl*t+b@2dl*t+b@3dl*t+b@4dl*t+1l; else ê* Rational approximation for tail region. *ê t = fsqrt_rnh-2* logfhqll; u = HHHHHc@0D*t+c@1DL*t+c@2DL*t+c@3DL*t+c@4DL*t+c@5DL êhhhhd@0d*t+d@1dl*t+d@2dl*t+d@3dl*t+1l; ê* The relative error of the approximation has absolute value less than 1.15e-9. One iteration of Halley's rational method Hthird orderl gives full machine precision... *ê if Hp>0.5fL bb@idxd = -u; else bb@idxd=u; ", "Acklamsingle", "Float", "Float", 512D; Relative error plots in left region ü setup In[43]:= uniforms = Table@10^-i, i, 31 ê 100, 14, 1 ê 100D Reverse N; n = uniforms Length Out[44]= 1370 In[45]:= In[46]:= luniforms = Log@10, uniformsd; exact = normalquantile@uniformsd; In[47]:= gpuuniforms = CUDAMemoryLoad@uniforms, "TargetPrecision" Ø "Single"D; gpunormals = CUDAMemoryAllocate@"Float", nd;
5 QuantilePrecisionInCUDA.nb 5 In[49]:= ü CUDA 4 built in kernelcuda@gpuuniforms, gpunormalsd; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"CUDA Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD CUDA Quantile Realized Log_10 Error - Left Tail Out[51]= In[52]:= ü SLB 2011 kernelws@gpuuniforms, gpunormalsd; back = CUDAMemoryGet@gpuUniformsD; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"H6,6L Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD H6,6L Quantile Realized Log_10 Error - Left Tail Out[55]=
6 6 QuantilePrecisionInCUDA.nb In[56]:= ü Giles gpunormalsd; ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Giles Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD Giles Quantile Realized Log_10 Error - Left Tail Out[5]= In[59]:= ü Acklam kernelacklam@gpuuniforms, gpunormalsd; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Acklam Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD Acklam Quantile Realized Log_10 Error - Left Tail Out[61]= CUDAMemoryUnload@gpuUniformsD CUDAMemoryUnload@gpuNormalsD
7 QuantilePrecisionInCUDA.nb 7 Double work In[63]:= Out[64]= 2701 uniforms = Table@SetPrecision@10^-i, 20D, i, 30 ê 100, 30, 11 ê 1000D Reverse; n = uniforms Length In[65]:= luniforms = Log@10, uniformsd; DP kernels In[73]:= In[74]:= kernelcudadp = CUDAFunctionLoad@" global void cuda_norminv_kernelhdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; out@id = out@id = M_SQRT2 * erfinvh2.0*u - 1.0L; ", "cuda_norminv_kernel", "Double", "Double", 512D; kernelas241 = CUDAFunctionLoad@" device double rpoly_value H int n, double a@d, double x L ****************************************************************************0 Purpose: RPOLY_VALUE evaluates a double precision polynomial. Discussion: For sanity's sake, the value of N indicates the NUMBER of coefficients, or more precisely, the ORDER of the polynomial, rather than the DEGREE of the polynomial. The two quantities differ by 1, but cause a great deal of confusion. Given N and A, the form of the polynomial is: phxl = a@0d + a@1d * x a@n-2d * x^hn-2l + a@n-1d * x^hn-1l Licensing: This code is distributed under the GNU LGPL license. Modified: 13 August 2004 Author: John Burkardt Parameters: Input, int N, the order of the polynomial. Input, double A@ND, the coefficients of the polynomial. A@0D is the constant term. Input, double X, the point at which the polynomial is to be evaluated. Output, double RPOLY_VALUE, the value of the polynomial at X. int i; double value; value = 0.0;
8 QuantilePrecisionInCUDA.nb value = 0.0; for H i = n-1; 0 = i; i-- L value = value * x + a@id; return value; global void AS241gpuHdouble * aa, double * bbl This GPU code adapted from JB's function: Hhis comments reproduced herel double r_normal_01_cdf_inverse H double p L Purpose: R_NORMAL_01_CDF_INVERSE inverts the standard normal CDF. Discussion: The result is accurate to about 1 part in 10**16. Modified: 27 December 2004 Author: Original FORTRAN77 version by Michael Wichura. C++ version by John Burkardt. Reference: Michael Wichura, The Percentage Points of the Normal Distribution, Algorithm AS 241, Applied Statistics, Volume 37, Number 3, pages , 19. Parameters: Input, double P, the value of the cumulative probability densitity function. 0 P 1. If P is outside this range, an \"infinite\" value is returned. Output, double R_NORMAL_01_CDF_INVERSE, the normal deviate value with the property that the probability of a standard normal deviate being less than or equal to this value is P. double a@d = , e+2, e+3, e+4, e+4, e+4, e+4, e+3 ; double b@d = 1.0, e+1, e+2, e+3, e+4, e+4, e+4, e+3 ; double c@d = , ,
9 QuantilePrecisionInCUDA.nb , , , , , e-1, e-2, e-4 ; double const1 = ; double const2 = 1.6; double d@d = 1.0, , , e-1, e-1, e-2, e-4, e-9 ; double e@d = , , , e-1, e-2, e-3, e-5, e-7 ; double f@d = 1.0, e-1, e-1, e-2, e-4, e-5, e-7, e-15 ; double p, q, absq; double r; double split1 = 0.425; double split2 = 5.0; double value; int idx = blockidx.x * blockdim.x + threadidx.x; p = aa@idxd; q = p - 0.5; if H q = 0 Labsq = -q; else absq = q; if Habsq = split1 L r = const1 - q * q; value = q * rpoly_value H, a, r L ê rpoly_value H, b, r L; else if H q 0.0 L r = p; else r = p; r = sqrt H -log H r L L; if H r = split2 L r = r - const2; value = rpoly_value H, c, r L ê rpoly_value H, d, r L; else r = r - split2; value = rpoly_value H, e, r L ê rpoly_value H, f, r L; if H q 0.0 L
10 10 QuantilePrecisionInCUDA.nb value = -value; In[69]:= In[70]:= bb@idxd = value; ", "AS241gpu", "Double", "Double", 512D; kernelwsexpdp = CUDAFunctionLoad@" global void ws_norminv_exp_42hdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; double half_minus_u = u; double v, p, q; double x = copysignh2.0*u, half_minus_ul; if Hhalf_minus_u 0.0L x += 2.0; v = -loghxl; p = e-14; p = p*v e-11; p = p*v e-; p = p*v e-6; p = p*v e-4; p = p*v e-3; p = p*v e-2; p = p*v e-1; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-13; q = q*v e; q = q*v e-7; q = q*v e-5; q = q*v e-4; q = q*v e-2; q = q*v e-1; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; out@id = p ê q * copysignhv, -half_minus_ul; ", "ws_norminv_exp_42", "Double", "Double", 512D; kernelwsdp = CUDAFunctionLoad@" inline device double ws_norminvhdouble ul double u_minus_half = u - 0.5; double v, p, q; v = u_minus_half * rsqrth fma_rnh-u, u, ull; Hu-0.5LêsqrtHu-u^2L v = copysignhv, 0.0L; if H allhv 15.5LL just use primary transformation p = e-;
11 QuantilePrecisionInCUDA.nb 11 p = e-; p = p*v e-6; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-9; q = q*v e-6; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; else fallback to exponential transformation ê* double one_minus_x = copysignh2.0*u, -u_minus_halfl; if Hu_minus_half > 0.0L one_minus_x += 2.0; v = -loghone_minus_xl; *ê ê* *ê double x = copysignh2.0*u, u_minus_halfl; x -= copysignh1.0, u_minus_halfl; v = -loghfmah-1.0, x, 1.0LL; p = e-14; p = p*v e-11; p = p*v e-; p = p*v e-6; p = p*v e-4; p = p*v e-3; p = p*v e-2; p = p*v e-1; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-13; q = q*v e; q = q*v e-7; q = q*v e-5; q = q*v e-4; q = q*v e-2;
12 12 QuantilePrecisionInCUDA.nb q = q*v e-1; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; return p ê q * copysignhv, u_minus_halfl; return p * drcp_rnhql * copysignhv, u_minus_halfl; global void ws_norminv_kernelhdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; out@id = ws_norminvhul; ", "ws_norminv_kernel", "Double", "Double", 512D; Precision plots In[7]:= Out[]= 1791 uniforms = Table@SetPrecision@10^-i, 20D, i, 30 ê 100, 20, 11 ê 1000D Reverse; n = uniforms Length In[9]:= luniforms = Log@10, uniformsd; In[90]:= exact = SetPrecision@normalQuantile@uniformsD, 20D; In[91]:= exact@@2dd Out[91]= In[92]:= In[94]:= gpuuniforms = CUDAMemoryLoad@uniformsD; gpunormals = CUDAMemoryAllocate@"Double", nd; exact@@2dd Out[94]= In[95]:= Log@10, 2^H-54LD N Out[95]= In[116]:= ü AS241 gpuuniforms = CUDAMemoryLoad@uniformsD; gpunormals = CUDAMemoryAllocate@"Double", nd;
13 QuantilePrecisionInCUDA.nb 13 In[11]:= gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"AS241 Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD AS241 Quantile Realized Log_10 Error - Left Tail Out[121]= -15 In[96]:= ü CUDA 4 kernelcudadp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"CUDA Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD CUDA Quantile Realized Log_10 Error - Left Tail Out[99]=
14 14 QuantilePrecisionInCUDA.nb In[100]:= ü SLB Appendix B kernelwsexpdp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Branchless Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD Branchless Quantile Realized Log_10 Error - Left Tail Out[103]= -15 In[104]:= ü SLB Appendix C (Student t hybrid) kernelwsdp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"T2 Hybrid Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD T2 Hybrid Quantile Realized Log_10 Error - Left Tail Out[107]= Timing reminder
15 QuantilePrecisionInCUDA.nb 15 In double precision the timings on a Quadro 4000 for a standard batch were AS ms CUDA ms SLB breakless 117ms SLB hybrid 933ms Timings on a C2050 are usually better than half for the Q4000.
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationCS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/
CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationApproximation of inverse Poisson CDF on GPUs
Approximation of inverse Poisson CDF on GPUs Mike Giles Mathematical Institute, University of Oxford Oxford-Man Institute of Quantitative Finance 38th Conference on Stochastic Processes and their Applications
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationTwo case studies of Monte Carlo simulation on GPU
Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice
More informationAcceleration of Deterministic Boltzmann Solver with Graphics Processing Units
Acceleration of Deterministic Boltzmann Solver with Graphics Processing Units V.V.Aristov a, A.A.Frolova a, S.A.Zabelok a, V.I.Kolobov b and R.R.Arslanbekov b a Dorodnicn Computing Centre of the Russian
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationMathematica examples relevant to Legendre functions
Mathematica eamples relevant to Legendre functions Legendre Polynomials are built in Here is Legendre s equation, and Mathematica recognizes as being solved by Legendre polynomials (LegendreP) and the
More informationHIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix
More informationGPU Applications for Modern Large Scale Asset Management
GPU Applications for Modern Large Scale Asset Management GTC 2014 San José, California Dr. Daniel Egloff QuantAlea & IncubeAdvisory March 27, 2014 Outline Portfolio Construction Outline Portfolio Construction
More informationTopic 17. Analysis of Algorithms
Topic 17 Analysis of Algorithms Analysis of Algorithms- Review Efficiency of an algorithm can be measured in terms of : Time complexity: a measure of the amount of time required to execute an algorithm
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationComputer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic
Computer Arithmetic MATH 375 Numerical Analysis J. Robert Buchanan Department of Mathematics Fall 2013 Machine Numbers When performing arithmetic on a computer (laptop, desktop, mainframe, cell phone,
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationProf. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa
Accelerated Astrophysics: Using NVIDIA GPUs to Simulate and Understand the Universe Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa Cruz brant@ucsc.edu, UC
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More informationarxiv: v1 [cs.na] 8 Feb 2016
Toom-Coo Multiplication: Some Theoretical and Practical Aspects arxiv:1602.02740v1 [cs.na] 8 Feb 2016 M.J. Kronenburg Abstract Toom-Coo multiprecision multiplication is a well-nown multiprecision multiplication
More informationApplied C Fri
Applied C++11 2013-01-25 Fri Outline Introduction Auto-Type Inference Lambda Functions Threading Compiling C++11 C++11 (formerly known as C++0x) is the most recent version of the standard of the C++ Approved
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationSTAT2201 Assignment 3 Semester 1, 2017 Due 13/4/2017
Class Example 1. Single Sample Descriptive Statistics (a) Summary Statistics and Box-Plots You are working in factory producing hand held bicycle pumps and obtain a sample of 174 bicycle pump weights in
More informationHigh-performance processing and development with Madagascar. July 24, 2010 Madagascar development team
High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationPOLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS
POLITECNICO DI MILANO Facoltà di Ingegneria dell Informazione Corso di Laurea in Ingegneria Informatica DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS Relatore: Prof.
More informationEstimating VaR in credit risk: Aggregate vs single loss distribution
Estimating VaR in credit risk: Aggregate vs single loss distribution M. Assadsolimani and D. Chetalova arxiv:172.4388v1 [q-fin.cp] 14 Feb 217 Abstract Using Monte Carlo simulation to calculate the Value
More informationA new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang
A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic
More informationAlgorithm 955: approximation of the inverse Poisson cumulative distribution function
XXXX Algorithm 955: approximation of the inverse Poisson cumulative distribution function Michael B. Giles, University of Oxford New approximations for the inverse of the incomplete gamma function are
More informationUniversity of Alberta
University of Alberta Parallel Electromagnetic Transient Simulation of Large-Scale Power Systems on Massive-threading Hardware by Zhiyin Zhou A thesis submitted to the Faculty of Graduate Studies and Research
More informationSection 8.1 Vector and Parametric Equations of a Line in
Section 8.1 Vector and Parametric Equations of a Line in R 2 In this section, we begin with a discussion about how to find the vector and parametric equations of a line in R 2. To find the vector and parametric
More informationThe Mathematica Journal p-adic Arithmetic
The Mathematica Journal p-adic Arithmetic Stany De Smedt The p-adic numbers were introduced by K. Hensel in 1908 in his book Theorie der algebraïschen Zahlen, Leipzig, 1908. In this article we present
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most
More informationHW #6. 1. Inflaton. (a) slow-roll regime. HW6.nb 1
HW6.nb HW #6. Inflaton (a) slow-roll regime In the slow-roll regime, we neglect the kinetic energy as well as f ÿÿ term in the equation of motion. Then H = ÅÅÅ 8 p 3 G N ÅÅÅ m f, 3 H f ÿ + m f = 0. We
More informationBOOLEAN ALGEBRA INTRODUCTION SUBSETS
BOOLEAN ALGEBRA M. Ragheb 1/294/2018 INTRODUCTION Modern algebra is centered around the concept of an algebraic system: A, consisting of a set of elements: ai, i=1, 2,, which are combined by a set of operations
More informationü Define the medium ü Compute Fresnel polynomial ü Find non-zero points on the Fresnel surface of kappa.
In[]:= SetDirectory@"~êwritingêWIPêKappaLibê"
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationThe connected locus for complex cubic iteration
The connected locus for complex cubic iteration A preprint version of a Mathematical graphics column from Mathematica in Education and Research. Mark McClure Department of Mathematics University of North
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationarxiv: v1 [cs.dc] 4 Sep 2014
and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de
More informationComputing logarithms and other special functions
Computing logarithms and other special functions Mike Giles University of Oxford Mathematical Institute Napier 400 NAIS Symposium April 2, 2014 Mike Giles (Oxford) Computing special functions April 2,
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate
More informationCprE 281: Digital Logic
CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital
More informationL1 gives the first component, writes first component in X's P's and 1's (1 represents the empty link and works in the following command):
More information
Efficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationSolving RODEs on GPU clusters
HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206 Motivation - Parallel Computing HIGH TEA @ SCIENCE, March
More informationF O R SOCI AL WORK RESE ARCH
7 TH EUROPE AN CONFERENCE F O R SOCI AL WORK RESE ARCH C h a l l e n g e s i n s o c i a l w o r k r e s e a r c h c o n f l i c t s, b a r r i e r s a n d p o s s i b i l i t i e s i n r e l a t i o n
More informationReclaiming Meaning in Mathematics
Reclaiming Meaning in Mathematics A Presentation for the WSCC 2007 Mathematics Conference William Bricken, PhD Lake Washington Technical College william.bricken@lwtc.edu For Teachers An educational chasm
More informationPhysicist's Introduction to Mathematica
Physicist's Introduction to Mathematica Laboratory 6 Part B Fitting Curves to Data Preliminary Remarks It is increasingly rare for your activity in the laboratory to be describable as "hands-on" production
More informationAdiabatic Quantum Computing Applied to the 3- SAT Problem
Adiabatic Quantum Computing Applied to the 3- SAT Problem by José Luis Gómez-Muñoz http://homepage.cem.itesm.mx/lgomez/quantum/ jose.luis.gomez@itesm.mx Introduction Quantum Adiabatic Commputing encodes
More informationFast event generation system using GPU. Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing
Fast event generation system using GPU Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing Motivation The mount of LHC data is increasing. -5fb -1 in 2011-22fb -1 in 2012 High statistics data ->
More informationPlanning for Reactive Behaviors in Hide and Seek
University of Pennsylvania ScholarlyCommons Center for Human Modeling and Simulation Department of Computer & Information Science May 1995 Planning for Reactive Behaviors in Hide and Seek Michael B. Moore
More informationPuReMD-GPU: A Reactive Molecular Dynamic Simulation Package for GPUs
Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 2012 PuReMD-GPU: A Reactive Molecular Dynamic Simulation Package for GPUs Sudhir B. Kylasa
More informationAuto-Tuning Complex Array Layouts for GPUs - Supplemental Material
BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationAUTHOR QUERY FORM. Fax: For correction or revision of any artwork, please consult
Our reference: YJCPH 52 P-authorquery-v7 AUTHOR QUERY FORM Journal: YJCPH Please e-mail or fax your responses and any corrections to: Article Number: 52 E-mail: corrections.esch@elsevier.vtex.lt Fax: +
More informationENGG 1203 Tutorial_9 - Review. Boolean Algebra. Simplifying Logic Circuits. Combinational Logic. 1. Combinational & Sequential Logic
ENGG 1203 Tutorial_9 - Review Boolean Algebra 1. Combinational & Sequential Logic 2. Computer Systems 3. Electronic Circuits 4. Signals, Systems, and Control Remark : Multiple Choice Questions : ** Check
More informationBackground. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography
Background Parallel Sieve Processing on Vector Processor and GPU Yasunori Ushiro (Earth Simulator Center) Yoshinari Fukui (Earth Simulator Center) Hidehiko Hasegawa (Univ. of Tsukuba) () RSA Cryptography
More informationCode Generation for GPU Accelerators in the Domain of Image Preprocessing
Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl,
More informationMultiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU
Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,
More informationLecture Notes 1: Platonic Convergence and the Central Limit Theorem
Lecture Notes : Platonic Convergence and the Central Limit Theorem ) An erroneous notion of limit: Take the standard formulation of the Central Limit Theorem (Feller 97, Vol. II;Grimmet & Stirzaker, 98):
More informationA Polynomial-Time Algorithm for Memory Space Reduction
A Polynomial-Time Algorithm for Memory Space Reduction Yonghong Song Cheng Wang Zhiyuan Li Sun Microsystems, Inc. Department of Computer Sciences 4150 Network Circle Purdue University Santa Clara, CA 95054
More informationMathematica expressesvectors, matrices, and tensorsin the form of lists. For example, a one dimensional list is :
demo7.nb Demo #7 Lists, Vectors, and Matrices in Mathematica Ê Lists Mathematica expressesvectors, matrices, and tensorsin the form of lists. For example, a one dimensional list is : a = 85.0, 2.5, 4.6,
More information16. Deblurring Gaussian blur
6. Deblurring Gaussian blur 277 6. Deblurring Gaussian blur 6. Deblurring To discuss an application where really high order Gaussian derivatives are applied, we study the deblurring of Gaussian blur by
More informationIntroduction to Python
Introduction to Python Luis Pedro Coelho Institute for Molecular Medicine (Lisbon) Lisbon Machine Learning School II Luis Pedro Coelho (IMM) Introduction to Python Lisbon Machine Learning School II (1
More informationME 406 Bifurcations VII Subcritical Hopf Bifurcation
ME 406 Bifurcations VII Subcritical Hopf Bifurcation sysid Mathematica 4.1.2, DynPac 10.66, 3ê5ê2002 intreset; plotreset; 1. Introduction In this notebook, the seventh in a series of notebooks on bifurcations,
More informationContinued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.
Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic. Mourad Gouicem PEQUAN Team, LIP6/UPMC Nancy, France May 28 th 2013
More information1 What is the area model for multiplication?
for multiplication represents a lovely way to view the distribution property the real number exhibit. This property is the link between addition and multiplication. 1 1 What is the area model for multiplication?
More informationShort Division of Long Integers. (joint work with David Harvey)
Short Division of Long Integers (joint work with David Harvey) Paul Zimmermann October 6, 2011 The problem to be solved Divide efficiently a p-bit floating-point number by another p-bit f-p number in the
More informationExploiting In-Memory Processing Capabilities for Density Functional Theory Applications
Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications 2016 Aug 23 P. F. Baumeister, T. Hater, D. Pleiter H. Boettiger, T. Maurer, J. R. Brunheroto Contributors IBM R&D
More informationProperties of Continuous Probability Distributions The graph of a continuous probability distribution is a curve. Probability is represented by area
Properties of Continuous Probability Distributions The graph of a continuous probability distribution is a curve. Probability is represented by area under the curve. The curve is called the probability
More informationTheory of Computation 1 Sets and Regular Expressions
Theory of Computation 1 Sets and Regular Expressions Frank Stephan Department of Computer Science Department of Mathematics National University of Singapore fstephan@comp.nus.edu.sg Theory of Computation
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationDepartment of Electrical and Computer Engineering University of Wisconsin Madison. Fall Final Examination
Department of Electrical and Computer Engineering University of Wisconsin Madison ECE 553: Testing and Testable Design of Digital Systems Fall 2013-2014 Final Examination CLOSED BOOK Kewal K. Saluja Date:
More informationAccelerating Quantum Chromodynamics Calculations with GPUs
Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University
More informationVisualizing the distributions of the escape paths of quaternion fractals
Visualizing the distributions of the escape paths of quaternion fractals S. Halayka October 25, 2018 Abstract The length, displacement, and magnitude distributions of the escape paths of the points in
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationQuantum Random Walk: Mathematica Syntax and Dirac Notation
Quantum Random Walk: Mathematica Syntax and Dirac Notation by José Luis Gómez-Muñoz http://homepage.cem.itesm.mx/lgomez/quantum/ jose.luis.gomez@itesm.mx Based on calculations by Salvador Venegas-Andraca
More informationSampling Random Variables
Sampling Random Variables Introduction Sampling a random variable X means generating a domain value x X in such a way that the probability of generating x is in accordance with p(x) (respectively, f(x)),
More informationCHAPTER 6 : LITERATURE REVIEW
CHAPTER 6 : LITERATURE REVIEW Chapter : LITERATURE REVIEW 77 M E A S U R I N G T H E E F F I C I E N C Y O F D E C I S I O N M A K I N G U N I T S A B S T R A C T A n o n l i n e a r ( n o n c o n v e
More informationP E R E N C O - C H R I S T M A S P A R T Y
L E T T I C E L E T T I C E I S A F A M I L Y R U N C O M P A N Y S P A N N I N G T W O G E N E R A T I O N S A N D T H R E E D E C A D E S. B A S E D I N L O N D O N, W E H A V E T H E P E R F E C T R
More informationProbability Density Functions
Statistical Methods in Particle Physics / WS 13 Lecture II Probability Density Functions Niklaus Berger Physics Institute, University of Heidelberg Recap of Lecture I: Kolmogorov Axioms Ingredients: Set
More information3 A Linear Perturbation Formula for Inverse Functions Set Up... 5
Bryce Terwilliger Advisor: Ian Abramson June 2, 2011 Contents 1 Abstract 1 2 Introduction 1 2.1 Asymptotic Relative Efficiency................... 2 2.2 A poissonization approach...................... 3
More informationarxiv: v1 [cs.ne] 29 Jul 2014
A CUDA-Based Real Parameter Optimization Benchmark Ke Ding and Ying Tan School of Electronics Engineering and Computer Science, Peking University arxiv:1407.7737v1 [cs.ne] 29 Jul 2014 Abstract. Benchmarking
More informationMODULE 9 NORMAL DISTRIBUTION
MODULE 9 NORMAL DISTRIBUTION Contents 9.1 Characteristics of a Normal Distribution........................... 62 9.2 Simple Areas Under the Curve................................. 63 9.3 Forward Calculations......................................
More informationJPEG BMP. jpeg1.nb 1 JPEG. [Reference] /10/ /10/21 Takuichi Hirano (Tokyo Institute of Technology)
peg1.nb 1 JPEG JPEG [Reference] http://en.wikipedia.org/wiki/jpeg 2006/10/21 2006/10/21 Takuichi Hirano (Tokyo Institute of Technology) BMP In[1]:= Out[1]= SetDirectory@"d:êhira2êpublic_htmlêhobbyêeduêpeg"D
More informationSummarizing Measured Data
Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,
More informationSIMULATION OF ISING SPIN MODEL USING CUDA
SIMULATION OF ISING SPIN MODEL USING CUDA MIRO JURIŠIĆ Supervisor: dr.sc. Dejan Vinković Split, November 2011 Master Thesis in Physics Department of Physics Faculty of Natural Sciences and Mathematics
More informationsri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy
2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos
More informationReal-time signal detection for pulsars and radio transients using GPUs
Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence
More informationFirst, a look at using OpenACC on WRF subroutine advance_w dynamics routine
First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationImaging using GPU. V-K Veligatla, Kapteyn Institute P. Labropoulos, ASTRON and Kapteyn Institute L. Koopmans, Kapteyn Institute
Imaging using GPU V-K Veligatla, Kapteyn Institute P. Labropoulos, ASTRON and Kapteyn Institute L. Koopmans, Kapteyn Institute Introduction What is a GPU? Why another Imager? Large amount of data to be
More informationFast evaluation of the inverse Poisson CDF
Fast evaluation of the inverse Poisson CDF Mike Giles University of Oxford Mathematical Institute Ninth IMACS Seminar on Monte Carlo Methods July 16, 2013 Mike Giles (Oxford) Poisson inverse CDF July 16,
More informationMulticore Semantics and Programming
Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and
More informationComputer Science Introductory Course MSc - Introduction to Java
Computer Science Introductory Course MSc - Introduction to Java Lecture 1: Diving into java Pablo Oliveira ENST Outline 1 Introduction 2 Primitive types 3 Operators 4 5 Control Flow
More information