Quantile Precision Issues in CUDA

Size: px

Start display at page:

Download "Quantile Precision Issues in CUDA"

Willa Mosley
5 years ago
Views:

1 Quantile Precision Issues in CUDA Thomas Luu and William Shaw UCL, Dec Corrections to Set up and Introduction In[1]:= This Mathematica notebook uses the high-precision arithmetic in Mathematica and its CUDALink tools to investigate the precision of kernels for the normal quantile. First we load our high-precision benchmark. Note that this has been verified to >24 sig figs by comparison with the Steinbrecher-Shaw analysis (EJAM 200). := u - 1D In[5]:= In[6]:= Out[6]= In[7]:= Next we load the Mathematica CUDALink: Needs@"CUDALink`"D CUDAQ@D True CUDAInformation@D Out[7]= 1 Ø Name Ø Quadro 4000, Clock Rate Ø , Compute Capabilities Ø 2., GPU Overlap Ø 1, Maximum Block Dimensions Ø 1024, 1024, 64, Maximum Grid Dimensions Ø , , , Maximum Threads Per Block Ø 1024, Maximum Shared Memory Per Block Ø , Total Constant Memory Ø , Warp Size Ø 32, Maximum Pitch Ø , Maximum Registers Per Block Ø 32 76, Texture Alignment Ø 512, Multiprocessor Count Ø, Core Count Ø 256, Execution Timeout Ø 1, Integrated Ø False, Can Map Host Memory Ø True, Compute Mode Ø Default, Texture1D Width Ø , Texture2D Width Ø , Texture2D Height Ø , Texture3D Width Ø 204, Texture3D Height Ø 204, Texture3D Depth Ø 204, Texture2D Array Width Ø 16 34, Texture2D Array Height Ø 16 34, Texture2D Array Slices Ø 204, Surface Alignment Ø 512, Concurrent Kernels Ø True, ECC Enabled Ø False, TCC Enabled Ø False, Total Memory Ø CUDAFunctions for the quantile - float mode In[20]:= Hre is the one built into CUDA 4 kernelcuda = CUDAFunctionLoad@" global void cuda_norminvf_kernelhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = HfloatLM_SQRT2 * erfinvfh2.0f*u - 1.0fL; ", "cuda_norminvf_kernel", "Float", "Float", 512D; Here is one for the kernel of Appendix A of Shaw-Luu-Brickman 2011:

2 2 QuantilePrecisionInCUDA.nb In[21]:= kernelws = CUDAFunctionLoad@" inline device float ws_norminvfhfloat ul float half_minus_u = 0.5f - u; float v, p, q; float one_minus_x = copysignfh2.0f*u, half_minus_ul; if Hhalf_minus_u 0.0fL one_minus_x += 2.0f; v = - logfhone_minus_xl; p = e-4f; p = p*v f; p = p*v f; p = p*v f; p = p*v f; p = p*v f; q = e-6f; q = q*v f; q = q*v f; q = q*v f; q = q*v f; q = q*v f; q = q*v + 1.0f; return - fdividefhp, ql * copysignfhv, half_minus_ul; global void ws_norminvf_kernelhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = ws_norminvfhul; ", "ws_norminvf_kernel", "Float", "Float", 512D; Here are the kernels based on the paper by Giles and the web site by Acklam, for float operation

3 QuantilePrecisionInCUDA.nb 3 In[]:= kernelmg = CUDAFunctionLoad@" inline device float MBG_erfinvHfloat xl float w, p; w = - logfhh1.0f-xl*h1.0f+xll; if H w f L w = w f; p = e-0f; p = e-07f + p*w; p = e-06f + p*w; p = e-06f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; else w = sqrtfhwl f; p = f; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; p = f + p*w; return p*x; global void mg_norminvfhfloat *in, float *outl int i = threadidx.x + blockidx.x * blockdim.x; float u = in@id; out@id = HfloatLM_SQRT2 * MBG_erfinvH2.0f*u - 1.0fL; ", "mg_norminvf", "Float", "Float", 512D;

4 4 QuantilePrecisionInCUDA.nb In[9]:= kernelacklam = CUDAFunctionLoad@" global void AcklamsingleHfloat * aa, float * bbl const float a@6d = e+01f, e+02f, e+02f, e+02f, e+01f, ; const float b@5d = e+01f, e+02f, e+00f e+02f, e+01f, e+01f ; const float c@6d = e-03f, e-01f, e+00f, e+00f, e+00f, e+00f ; const float d@4d = ; e-03f, e+00f, float p, q, t, u; e-01f, e+00f int idx = blockidx.x * blockdim.x + threadidx.x; p = aa@idxd; if Hp1.0f-pL q=p; else q=1.0f-p; if Hq > fL ê* Rational approximation for central region. *ê u = q-0.5f; t = u*u; u = u*hhhhha@0d*t+a@1dl*t+a@2dl*t+a@3dl*t+a@4dl*t+a@5dl êhhhhhb@0d*t+b@1dl*t+b@2dl*t+b@3dl*t+b@4dl*t+1l; else ê* Rational approximation for tail region. *ê t = fsqrt_rnh-2* logfhqll; u = HHHHHc@0D*t+c@1DL*t+c@2DL*t+c@3DL*t+c@4DL*t+c@5DL êhhhhd@0d*t+d@1dl*t+d@2dl*t+d@3dl*t+1l; ê* The relative error of the approximation has absolute value less than 1.15e-9. One iteration of Halley's rational method Hthird orderl gives full machine precision... *ê if Hp>0.5fL bb@idxd = -u; else bb@idxd=u; ", "Acklamsingle", "Float", "Float", 512D; Relative error plots in left region ü setup In[43]:= uniforms = Table@10^-i, i, 31 ê 100, 14, 1 ê 100D Reverse N; n = uniforms Length Out[44]= 1370 In[45]:= In[46]:= luniforms = Log@10, uniformsd; exact = normalquantile@uniformsd; In[47]:= gpuuniforms = CUDAMemoryLoad@uniforms, "TargetPrecision" Ø "Single"D; gpunormals = CUDAMemoryAllocate@"Float", nd;

5 QuantilePrecisionInCUDA.nb 5 In[49]:= ü CUDA 4 built in kernelcuda@gpuuniforms, gpunormalsd; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"CUDA Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD CUDA Quantile Realized Log_10 Error - Left Tail Out[51]= In[52]:= ü SLB 2011 kernelws@gpuuniforms, gpunormalsd; back = CUDAMemoryGet@gpuUniformsD; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"H6,6L Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD H6,6L Quantile Realized Log_10 Error - Left Tail Out[55]=

6 6 QuantilePrecisionInCUDA.nb In[56]:= ü Giles gpunormalsd; ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Giles Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD Giles Quantile Realized Log_10 Error - Left Tail Out[5]= In[59]:= ü Acklam kernelacklam@gpuuniforms, gpunormalsd; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Acklam Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14DD Acklam Quantile Realized Log_10 Error - Left Tail Out[61]= CUDAMemoryUnload@gpuUniformsD CUDAMemoryUnload@gpuNormalsD

7 QuantilePrecisionInCUDA.nb 7 Double work In[63]:= Out[64]= 2701 uniforms = Table@SetPrecision@10^-i, 20D, i, 30 ê 100, 30, 11 ê 1000D Reverse; n = uniforms Length In[65]:= luniforms = Log@10, uniformsd; DP kernels In[73]:= In[74]:= kernelcudadp = CUDAFunctionLoad@" global void cuda_norminv_kernelhdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; out@id = out@id = M_SQRT2 * erfinvh2.0*u - 1.0L; ", "cuda_norminv_kernel", "Double", "Double", 512D; kernelas241 = CUDAFunctionLoad@" device double rpoly_value H int n, double a@d, double x L ****************************************************************************0 Purpose: RPOLY_VALUE evaluates a double precision polynomial. Discussion: For sanity's sake, the value of N indicates the NUMBER of coefficients, or more precisely, the ORDER of the polynomial, rather than the DEGREE of the polynomial. The two quantities differ by 1, but cause a great deal of confusion. Given N and A, the form of the polynomial is: phxl = a@0d + a@1d * x a@n-2d * x^hn-2l + a@n-1d * x^hn-1l Licensing: This code is distributed under the GNU LGPL license. Modified: 13 August 2004 Author: John Burkardt Parameters: Input, int N, the order of the polynomial. Input, double A@ND, the coefficients of the polynomial. A@0D is the constant term. Input, double X, the point at which the polynomial is to be evaluated. Output, double RPOLY_VALUE, the value of the polynomial at X. int i; double value; value = 0.0;

8 QuantilePrecisionInCUDA.nb value = 0.0; for H i = n-1; 0 = i; i-- L value = value * x + a@id; return value; global void AS241gpuHdouble * aa, double * bbl This GPU code adapted from JB's function: Hhis comments reproduced herel double r_normal_01_cdf_inverse H double p L Purpose: R_NORMAL_01_CDF_INVERSE inverts the standard normal CDF. Discussion: The result is accurate to about 1 part in 10**16. Modified: 27 December 2004 Author: Original FORTRAN77 version by Michael Wichura. C++ version by John Burkardt. Reference: Michael Wichura, The Percentage Points of the Normal Distribution, Algorithm AS 241, Applied Statistics, Volume 37, Number 3, pages , 19. Parameters: Input, double P, the value of the cumulative probability densitity function. 0 P 1. If P is outside this range, an \"infinite\" value is returned. Output, double R_NORMAL_01_CDF_INVERSE, the normal deviate value with the property that the probability of a standard normal deviate being less than or equal to this value is P. double a@d = , e+2, e+3, e+4, e+4, e+4, e+4, e+3 ; double b@d = 1.0, e+1, e+2, e+3, e+4, e+4, e+4, e+3 ; double c@d = , ,

9 QuantilePrecisionInCUDA.nb , , , , , e-1, e-2, e-4 ; double const1 = ; double const2 = 1.6; double d@d = 1.0, , , e-1, e-1, e-2, e-4, e-9 ; double e@d = , , , e-1, e-2, e-3, e-5, e-7 ; double f@d = 1.0, e-1, e-1, e-2, e-4, e-5, e-7, e-15 ; double p, q, absq; double r; double split1 = 0.425; double split2 = 5.0; double value; int idx = blockidx.x * blockdim.x + threadidx.x; p = aa@idxd; q = p - 0.5; if H q = 0 Labsq = -q; else absq = q; if Habsq = split1 L r = const1 - q * q; value = q * rpoly_value H, a, r L ê rpoly_value H, b, r L; else if H q 0.0 L r = p; else r = p; r = sqrt H -log H r L L; if H r = split2 L r = r - const2; value = rpoly_value H, c, r L ê rpoly_value H, d, r L; else r = r - split2; value = rpoly_value H, e, r L ê rpoly_value H, f, r L; if H q 0.0 L

10 10 QuantilePrecisionInCUDA.nb value = -value; In[69]:= In[70]:= bb@idxd = value; ", "AS241gpu", "Double", "Double", 512D; kernelwsexpdp = CUDAFunctionLoad@" global void ws_norminv_exp_42hdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; double half_minus_u = u; double v, p, q; double x = copysignh2.0*u, half_minus_ul; if Hhalf_minus_u 0.0L x += 2.0; v = -loghxl; p = e-14; p = p*v e-11; p = p*v e-; p = p*v e-6; p = p*v e-4; p = p*v e-3; p = p*v e-2; p = p*v e-1; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-13; q = q*v e; q = q*v e-7; q = q*v e-5; q = q*v e-4; q = q*v e-2; q = q*v e-1; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; out@id = p ê q * copysignhv, -half_minus_ul; ", "ws_norminv_exp_42", "Double", "Double", 512D; kernelwsdp = CUDAFunctionLoad@" inline device double ws_norminvhdouble ul double u_minus_half = u - 0.5; double v, p, q; v = u_minus_half * rsqrth fma_rnh-u, u, ull; Hu-0.5LêsqrtHu-u^2L v = copysignhv, 0.0L; if H allhv 15.5LL just use primary transformation p = e-;

11 QuantilePrecisionInCUDA.nb 11 p = e-; p = p*v e-6; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-9; q = q*v e-6; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; else fallback to exponential transformation ê* double one_minus_x = copysignh2.0*u, -u_minus_halfl; if Hu_minus_half > 0.0L one_minus_x += 2.0; v = -loghone_minus_xl; *ê ê* *ê double x = copysignh2.0*u, u_minus_halfl; x -= copysignh1.0, u_minus_halfl; v = -loghfmah-1.0, x, 1.0LL; p = e-14; p = p*v e-11; p = p*v e-; p = p*v e-6; p = p*v e-4; p = p*v e-3; p = p*v e-2; p = p*v e-1; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; p = p*v ; q = e-13; q = q*v e; q = q*v e-7; q = q*v e-5; q = q*v e-4; q = q*v e-2;

12 12 QuantilePrecisionInCUDA.nb q = q*v e-1; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v ; q = q*v + 1.0; return p ê q * copysignhv, u_minus_halfl; return p * drcp_rnhql * copysignhv, u_minus_halfl; global void ws_norminv_kernelhdouble *in, double *outl int i = threadidx.x + blockidx.x * blockdim.x; double u = in@id; out@id = ws_norminvhul; ", "ws_norminv_kernel", "Double", "Double", 512D; Precision plots In[7]:= Out[]= 1791 uniforms = Table@SetPrecision@10^-i, 20D, i, 30 ê 100, 20, 11 ê 1000D Reverse; n = uniforms Length In[9]:= luniforms = Log@10, uniformsd; In[90]:= exact = SetPrecision@normalQuantile@uniformsD, 20D; In[91]:= exact@@2dd Out[91]= In[92]:= In[94]:= gpuuniforms = CUDAMemoryLoad@uniformsD; gpunormals = CUDAMemoryAllocate@"Double", nd; exact@@2dd Out[94]= In[95]:= Log@10, 2^H-54LD N Out[95]= In[116]:= ü AS241 gpuuniforms = CUDAMemoryLoad@uniformsD; gpunormals = CUDAMemoryAllocate@"Double", nd;

13 QuantilePrecisionInCUDA.nb 13 In[11]:= gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"AS241 Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD AS241 Quantile Realized Log_10 Error - Left Tail Out[121]= -15 In[96]:= ü CUDA 4 kernelcudadp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"CUDA Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD CUDA Quantile Realized Log_10 Error - Left Tail Out[99]=

14 14 QuantilePrecisionInCUDA.nb In[100]:= ü SLB Appendix B kernelwsexpdp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"Branchless Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD Branchless Quantile Realized Log_10 Error - Left Tail Out[103]= -15 In[104]:= ü SLB Appendix C (Student t hybrid) kernelwsdp@gpuuniforms, gpunormalsd; normals = SetPrecision@normals, 20D; ListPlot@Transpose@luniforms, Log@10, Abs@normals ê exact - 1DDD, PlotRange Ø -20, 0, Joined Ø True, InterpolationOrder Ø 1, PlotLabel Ø Style@"T2 Hybrid Quantile Realized Log_10 Error - Left Tail", 16, BoldD, LabelStyle Ø Directive@Bold, 14D, Epilog Ø Line@ , -20, , 0DD T2 Hybrid Quantile Realized Log_10 Error - Left Tail Out[107]= Timing reminder

15 QuantilePrecisionInCUDA.nb 15 In double precision the timings on a Quadro 4000 for a standard batch were AS ms CUDA ms SLB breakless 117ms SLB hybrid 933ms Timings on a C2050 are usually better than half for the Q4000.

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed