Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography

Size: px

Start display at page:

Download "Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography"

Cory Knight
5 years ago
Views:

1 Background Parallel Sieve Processing on Vector Processor and GPU Yasunori Ushiro (Earth Simulator Center) Yoshinari Fukui (Earth Simulator Center) Hidehiko Hasegawa (Univ. of Tsukuba) () RSA Cryptography is the key technology for safe Internet use. () The safeness is based on the result that the factorization algorithm of a long-digit number n to P and Q has high computational complexity and consumes enormous computation time. () To guarantee, a decryption time of more than 0 years is necessary, even using the fastest computer. SIAM PP, Feb. 6, 0 Another interests Different Architecture for RSA (GPU and Vector Processor instead of PC) Non-Floating Point Number Operations (Almost 0 GFLOPS) Other Usage of GPU and Vector Processor RSA Cryptography Creation of keys () Choose prime numbers p, q, and e () Set n = p * q and f = (p-) * (q-) () Compute d = /e (mod f) Encryption Compute C = M e (mod n) Public keys (e, n) are used Decryption Compute M = C d (mod n); Euler s theorem M f (mod n) Secure keys (d, n) are used 4 Computation time of RSA-768 ( digits) CPU Year ratio(%) Sieve Processing Matrices Proc. 9 Exploration of polynomial 0 Algebraic Square root 0 Others 0 Cf. AMD64 (.GHz, Core) Sieve method Factorize N based on the relationship: A -B =(A-B)(A+B) 0 (mod N) () Gather numerous A i and B i such that A l A l A k lk B m B m B j mj (mod N) () Look for even l i and m i with factorization of 0- Matrices Dimension: 9,796,0 * 9,79,0 6

2 Steps of Sieve Processing Comp. Iteration Size Set Number Long 0^{00} Compute Base int 0^8 Choose prime number Q l and f l (x) int 0^ 4 Process for each l 0^ 4. Repeat Step 4.. (0^{}) /LP 4.. Compute PS[i] V[i]=0 Sieve Processing Double int int64 LP LP N*LP/p p (Harmonic mean of primes) LP: 0^6 for PC, 0^8 for GPU, 0^9 for ES LP LP LP LP LP p p p p N primes p LP/p Steps

3 4.. Kernel of Sieve for (k=0; k<n; k++) : # of primes in base { for (i=start[k]; i<lp; i+=prime[k]) { V[i] += LogP[k]; } LP : Size of Sieve } for (i=0; i<lp; i++) : Pick up sieved data { if(v[i] >= PS[i]) { Sive[No] = Pointer + i; No++; } } No : # of sieved data Update Start[0]~Start[N-] for next Sieve Tuning for Sieve Processing Base must be stored in fast memory PC (Cache) Shorter LP (LP is size of Sieve) 0^6 Vector Processor ES LP=0^9; 0^ times larger than that of PC Picking up of sieved data is slow ( if exists in a loop) GPU(GTX480) LP=0^8; 0^ times larger than that of PC Discontinuous memory access (stride varies) 4 Tuning for Vector Processor ES Picking up ratio is 0^{-6} ~0^{-0} Compute Maximum value in a block instead of picking up Modification: If the maximum number > value_b then perform Picking_up process Block size is 64K (6,6) This results in times faster Tuning for GPU Each thread has smaller LP (LP of PC * 000 / 0480 threads) Larger Prime[k] makes small hit ratio if(ii < LP) V[ii] += LogP[k]; N is used for Loop length instead LP Incorrect result for V[ii]+=LogP[k]; Sieve Processing can permit small error (Loss of pick up is OK up to 0^{-}) types of parallelization are used based on the size of prime numbers in the base 6 GPU program (before) no = gn*bn; bn=, gn=40 for (k=0; k<lp; k+= no) LP=0*04 { i = (bn * blockidx.x + threadidx.x) + k; V[i] = 0; } syncthreads(); for (k=0; k<n; k++) 0,480 threads in LP { for (i=start[k]; i<lp; i+=prime[k]*no) { ii = (bn*blockidx.x+threadidx.x)*prime[k]+i; if(ii < LP) V[ii] += LogP[k]; } syncthreads(); } 7 GPU program (after) for (k=0; k<n; k++) Sieve for ~0(K)-th primes { for (i=start[k]; i<lp; i+=prime[k]*gn*bn) { ii = (bn*blockidx.x + threadidx.x)*prime[k] + i; if(ii < LP) V[ii] += LogP[k]; syncthreads(); } } for (k=n; k<n; k+=gn) K+~40K-th primes { kk = blockidx.x + k; for (i=start[kk]; i<lp; i+=prime[kk]*bn) { ii = threadidx.x*prime[k] + i; if(ii < LP) V[ii] += LogP[kk]; syncthreads(); } } for (k=n; k<n; k+=gn*bn) Over (40K+)-th primes { kk =(bn*blockidx.x + threadidx.x) + k; for (i=start[kk]; i<lp; i+=prime[kk]) { if(i < LP) V[i] += LogP[kk]; syncthreads(); } } 8

4 Change of Result on Modified Program 60 digits, LP=0*04, Prime numbers N=0,000*04, 000 iterations M: Number of Sieved data () Original: M=4484 () After parallelization (times) M=[4488, 44867], mean=4480 largest difference = 87 (0.0%) 9 Conditions PC: Core of Dell Vostro 00 Intel Core,.GHz, GB Windows Vista, gcc, -O Vector Processor ES node (8 CPU).GHz: 89Gflops, 8GB SUPER-UX, Auto-parallel FORTRAN+MPI GPU NVIDIA GTX80.44GHz,.GB Unix, CUDA., -O 0 Time of Sieve Processing 60 digits Ratio of Sieve Processing 60 digits Computation time (hours) PC GPU ES ( node) ratio 00 0 PC/ES PC/GPU GPU/ES Prime numbers in Base (*0^6) Prime numbers in Base (0^6) ratio Dependency of Base size digits 0 digits PC/ES PC/GPU.E+04.E+0.E+06.E+07.E+08.E+09.E+0 Prime numbers in Base ES (Vector Processor) Parallelization is simple and easy, however a special treatment is needed for storing sieved data. GPU Fine grain Parallelization is needed, and that makes data dependency for sieved data. By omitting some data inconsistencies, we chose forced parallelization. 4

5 Summary of Sieve Processing Almost all ops. are addition of bits integer with different-stride 99.9 % are vectorizable, and easy MPI-parallelizable Speed depends on the size of fast memory One node of ES is times faster than PC (Check before pick up becomes times faster) GPU (GXT80) is 60 times faster than PC (Force-parallelization has 0.0% loss of sieved data; types of parallelization are used based on the magnitude of primes) High speed range of Base is ES >> GPU >> PC Forecast of RSA-768 ( digits) ratio Time(Node Year) (%) CPU GPU ES Sieve Processing Matrices polynomials Alg. SQRT Others Prediction Guess of RSA-04 (09 digits) ratio Time(Node Year) (%) CPU GPU ES Sieve Processing 8 6*0 0^ Matrices 9 4* Polynomials Alg. SQRT Others Guess is based on Sieve 0, 0- Mat. 0, Others 0

ENHANCING THE PERFORMANCE OF FACTORING ALGORITHMS

ENHANCING THE PERFORMANCE OF FACTORING ALGORITHMS GIVEN n FIND p 1,p 2,..,p k SUCH THAT n = p 1 d 1 p 2 d 2.. p k d k WHERE p i ARE PRIMES FACTORING IS CONSIDERED TO BE A VERY HARD. THE BEST KNOWN ALGORITHM