Shortest Lattice Vector Enumeration on Graphics Cards

Size: px

Start display at page:

Download "Shortest Lattice Vector Enumeration on Graphics Cards"

Pierce Cory Fitzgerald
5 years ago
Views:

1 Shortest Lattice Vector Enumeration on Graphics Cards Jens Hermans 1 Michael Schneider 2 Fréderik Vercauteren 1 Johannes Buchmann 2 Bart Preneel 1 1 K.U.Leuven 2 TU Darmstadt SHARCS - 10 September 2009

3 Why GPU? (Source: MSI)

4 CUDA framework Warning: sales talk Your own personal supercomputer for < e500. Nvidia CUDA Framework: Run general programs on GPU More complex operations, data types, branching... Recent GPU required Theory: 1TFlop (practice: 200 GFlop)

5 Crypto on GPU Current applications: Ciphers: RSA 1, ECC 2, AES 3 Cryptanalysis: Factoring 4 Brute force Focus: high throughput, not latency 1 Moss, Page, Smart / Szerwinski, Guneysu / Fleissner 2 Szerwinski, Guneysu 3 Manavski / Harrison, Waldron 4 Bernstein, Chen, Cheng, Lange, Yang

7 Lattices b 2 b 1 Basis matrix B = {b 1,..., b n } with b i R d Lattice: L(B) = { n i=1 x ib i, x i Z}

8 Shortest Vector Problem (SVP) b 2 b 1 b 2 b 1 Basis not unique Idea: good basis B and bad basis B Finding λ 1 (L) is hard with B

9 Algorithms for SVP Shortest vector problem Compute min x Z n Bx 2 SVP algorithms: LLL (+variants): approximate solution, polynomial BKZ... Enum: exact solution, exponential = This talk: focus on enum.

10 Enumeration xn =... x n 1 =... x 2 =... x 1 =... Optimum A = Bx 2 2 and x = [1, 0,..., 0]

11 Enumeration xn =... x n 1 =... x 2 =... x 1 =... Intermediate norm l 2 s.t. l i l i+1 (with l 1 = Bx 2 2 )

12 Enumeration xn =... x n 1 =... x 2 =... x 1 =... New optimum A = Bx 2 2

13 Enumeration xn =... x n 1 =... x 2 =... x 1 =... Cut off branch if l i > A.

14 Enumeration xn =... x n 1 =... x 2 =... x 1 =...

15 Enumeration xn =... x n 1 =... x 2 =... x 1 =...

16 Enumeration xn =... x n 1 =... x 2 =... x 1 =...

17 Programming model Memory

18 Processor Programming model Memory Nvidia GTX280: 240 cores, scalar processors 30 multiprocessors (8 cores each) 1.3 GHz 1GB Global Memory 32 & 64-bit integers, FP

19 Programming model Programming model Memory (Source: CUDA programming guide)

20 Memory types Programming model Memory (Source: CUDA programming guide)

22 Algorithm Flow x n. x α. x 1 α

23 Basic idea Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i Output: (x 1,..., x n) with P n i=1 x ib i = λ1(l)

24 Basic idea Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] Output: (x 1,..., x n) with P n i=1 x ib i = λ1(l)

25 Basic idea Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] GPU thread: run a sub-enum on x i, if new optimum store in x Output: (x 1,..., x n) with P n i=1 x ib i = λ1(l)

26 Basic idea Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] GPU thread: run a sub-enum on x i, if new optimum store in x Output: (x 1,..., x n) with P n i=1 x ib i = λ1(l) = horrible performance

27 Early termination... Input: B, A, α, n 1 2 Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] 8 9 Output: (x 1,..., x n) with P n i=1 x ib i = λ 1(L)

28 Early termination... Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] GPU thread: while there are x i left. do Start enum for a certain x i = [0,..., 0, x α,..., x n] Stop enum after S steps, store the state {l i, x i, s i = S} end Output: (x 1,..., x n) with P n i=1 x ib i = λ 1(L)

29 Early termination... Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] GPU thread: while there are x i left. do Start enum for a certain x i = [0,..., 0, x α,..., x n] Stop enum after S steps, store the state {l i, x i, s i = S} end CPU: Get enum state x i = [ x 1,..., x α 1, x α,..., x n] Output: (x 1,..., x n) with P n i=1 x ib i = λ 1(L)

30 Early termination... Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i CPU: generate x i = [0,..., 0, x α,..., x n] GPU thread: while there are x i left. do Start enum for a certain x i = [0,..., 0, x α,..., x n] Stop enum after S steps, store the state {l i, x i, s i = S} end CPU: Get enum state x i = [ x 1,..., x α 1, x α,..., x n] CPU: Continue enum if S was reached Output: (x 1,..., x n) with P n i=1 x ib i = λ 1(L) = solves length difference problem, still not so good

31 Iterating Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i while true do CPU: generate some x i = [0,..., 0, x α,..., x n ] 9 10 CPU: Get enum state x i = [ x 1,..., x α 1, x α,..., x n ] end Output: (x 1,..., x n ) with n i=1 x ib i = λ1 (L)

32 Iterating Input: B, A, α, n Compute the Gram-Schmidt decomposition of b i while true do CPU: generate some x i = [0,..., 0, x α,..., x n ] GPU thread: while there are x i left. do Start enum for a certain x i or continue enum for x i Stop enum after S steps, store the state {l i, x i, s i = S} end CPU: Get enum state x i = [ x 1,..., x α 1, x α,..., x n ] end Output: (x 1,..., x n ) with n i=1 x ib i = λ1 (L)

33 GPU Enumeration x α... x n x 1

34 GPU Enumeration x α... x n x 1

35 GPU Enumeration x α... x n x 1

36 GPU Enumeration x α... x n x 1

37 GPU Enumeration x α... x n x 1

38 Implementation details Some facts & figures: Dimension 50, starting vectors upload & download 20 MB of data to GPU CPU top enum: very fast (low dimension) GPU runs for > 10 seconds per iteration, iteration overhead is limited Share new optimal values among GPU threads

40 Throughput Throughput: CPU: around steps/s GPU: up to steps/s Throughput on GPU depends on: Lattice dimension n Length of sub-enumerations Number of parallel threads, uploaded points...

42 n fplll 18.3s 139s 277s 2483s 6960s CUDA 20.2s 92s 133s 959s 2599s 110% 66% 48% 39% 37% Table: Average time needed for enumeration of lattices in each dimension n.

43 Ideas for the Future Future: Generalize ideas (not specific for gpu s... clusters?) Use full power of CPU (now: idle during gpu-time) Gaussian heuristic

44 The end... Questions?

45 Algorithm Algorithm 1: High-level GPU ENUM Algorithm Input: b i, A, α, n Compute the Gram-Schmidt decomposition of b i while true do S = {(x i, x i, 2 x i, l i = α, s i = 0)} i Top enum: generate at most numstartpoints #T vectors R = {( x i, x i, 2 x i, l i, s i )} i GPU enumeration, starting from S T T {R i : s i S} if #T < cputhreshold then Enumerate the starting points in T on the CPU. Stop end end Output: (x 1,..., x n) with P n i=1 x ib i = λ1(l)

Random Sampling for Short Lattice Vectors on Graphics Cards

Random Sampling for Short Lattice Vectors on Graphics Cards Michael Schneider, Norman Göttert TU Darmstadt, Germany mischnei@cdc.informatik.tu-darmstadt.de CHES 2011, Nara September 2011 Michael Schneider