Exact Linear Algebra Algorithmic: Theory and Practice

Size: px
Start display at page:

Download "Exact Linear Algebra Algorithmic: Theory and Practice"

Transcription

1 Exact Linear Algebra Algorithmic: Theory and Practice ISSAC 15 Tutorial Clément Pernet Université Grenoble Alpes, Inria, LIP-AriC July 6, 2015 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

2 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

3 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

4 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

5 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc Need to structure the design. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

6 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

7 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured Need for high performance. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

8 Content The scope of this presentation: not an exhaustive overview on linear algebra algorithmic and complexity improvements a few guidelines, for the use and design of exact linear algebra in practice bridging the theoretical algorihmic development and practical efficiency concerns C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

9 Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

10 Choosing the underlying arithmetic Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

11 Choosing the underlying arithmetic Achieving high practical efficiency Most of linear algebra operations boil down to (a lot of) y y ± a * b dot-product matrix-matrix multiplication rank 1 update in Gaussian elimination Schur complements,... Efficiency relies on fast arithmetic fast memory accesses Here: focus on dense linear algebra C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

12 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

13 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

14 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

15 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT GF(p k ) Z p [X]/(Q) Polynomial, Kronecker, Zech log,... Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

16 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 : bit-packing uint64 t (Z 2 ) 64 dot-product (a,b) ^ : bit-wise XOR, (+ mod 2) & : bit-wise AND, (* mod 2) uint64_ t t = 0; for ( int k =0; k < N /64; ++ k) t ^= a[k] & b[k]; c = parity ( t) parity(x) if ( size (x) == 1) return x; else return parity ( High ( x) ^ Low ( x)) Can be parallelized on 64 instances. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

17 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

18 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

19 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B For k = log n O(n 3 / log n). In pratice: choice of k s.t. the table fits in L2 cache. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

20 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 The M4RI library [Albrecht Bard Hart 10] bit-packing Method of the Four Russians SIMD vectorization of boolean instructions (128 bits registers) Cache optimization Strassen s O(n 2.81 ) algorithm n SIMD bool arithmetic 2.109s s SIMD + 4 Russians 0.256s 2.829s 29.28s SIMD + 4 Russians + Strassen 0.257s 2.001s Table : Matrix product n n by n n, on an i5 SandyBridge 2.6Ghz. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

21 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

22 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

23 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

24 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

25 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

26 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. Recipe for Z 5 Use redundant representations on 3 bits + bit-slicing integer add + bool operations Pseudo-reduction mod 5 (4 3 bits) in 8 bool ops found by computer assisted search. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

27 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

28 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 } 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

29 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

30 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 Maintain locally a bounding interval on all matrices computed } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

31 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

32 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax addq %rax, %rcx movq -80(%rbp), %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

33 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

34 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

35 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 mulsd -56(%rbp), %xmm0 addsd %xmm0, %xmm1 movq %xmm1, %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

36 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units + vectorization y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, mulsd -56(%rbp), %xmm0 vmulpd %ymm1, %ymm0, %ymm0 addsd %xmm0, %xmm1 vaddpd (%rsi,%rax),%ymm0, %ymm0 movq %xmm1, %rax vmovapd %ymm0, (%rsi,%rax) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

37 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data MMX 64 bits SSE 128bits + AVX 256 bits AVX bits 4 x 64 = 256 bits x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

38 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

39 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here Execution Unit parallelism: multiple arith. units acting simulatneously on distinct registers C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

40 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Sandybridge micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 4 1 Floating Pt. Add 4 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

41 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Haswell micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 4 Or: 1 Integer Mul 4 2 Integer Add 4 FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

42 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization AMD Bulldozer micro-architecture Scheduler Port 0 Port 1 Port 2 Port 3 2 x 64 = 128 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 2 x 64 = 128 bits FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

43 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Nehalem micro-architecture Scheduler Port 0 Port 1 Port 5 2 x 64 = 128 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 2 1 Floating Pt. Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

44 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

45 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

46 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

47 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

48 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

49 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

50 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

51 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

52 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

53 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

54 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers: ressources Micro-architecture bible: Agner Fog s software optimization resources [ Experiments: dgemm (double): OpenBLAS [ igemm (int64 t): Eigen [ & FFLAS-FFPACK [linalg.org/projects/fflas-ffpack] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

55 Choosing the underlying arithmetic Using machine word arithmetic Integer Packing 32 bits: half the precision twice the speed double double double double float float float float float float float float Gfops double float int64 t int32 t Intel SandyBridge Intel Haswell AMD Bulldozer C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

56 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

57 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits best bit computation throughput for double precision floating points. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

58 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

59 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed log 2 p GMP 58.1s 94.6s 140s RecInt 5.7s 28.6s 837s RNS 0.785s 1.42s 1.88s n = 1000, on an Intel SandyBridge. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

60 Choosing the underlying arithmetic Larger field sizes In practice Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

61 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

62 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p RecInt mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

63 Reductions and building blocks Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

64 Reductions and building blocks Reductions to building blocks Huge number of algorithmic variants for a given computation in O(n 3 ). Need to structure the design of set of routines : Focus tuning effort on a single routine Some operations deliver better efficiency: in practice: memory access pattern in theory: better algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

65 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Speed CPU arithmetic throughput Memory speed Time C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

66 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Speed CPU arithmetic throughput Memory speed Time RAM L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

67 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Need to overlap communications by computation Speed CPU arithmetic throughput Memory speed Time RAM Design of BLAS 3 [Dongarra & Al. 87] Group all ops in Matrix products gemm: Work O(n 3 ) >> Data O(n 2 ) MatMul has become a building block in practice L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

68 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

69 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

70 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

71 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) MatMul has become a building block in theoretical reductions C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

72 Reductions and building blocks In dense linear algebra Reductions: theory HNF(Z) SNF(Z) Det(Z) LinSys(Z) MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

73 Reductions and building blocks In dense linear algebra Reductions: theory Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

74 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice LU(Z p) CharPoly(Z p) reduction trees are still relevant TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

75 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) LinSys(Z) SNF(Z) Common mistrust Fast linear algebra is never faster numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice reduction trees are still relevant LU(Z p) CharPoly(Z p) TRSM(Z p) Road map towards efficiency in practice 1 Tune the MatMul building block. MM(Z p) 2 Tune the reductions. C. Pernet Exact Linear Algebra3Algorithmic New reductions. July 6, / 73

76 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions k ( ) 2 p 1 2 < 2 mantissa C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

77 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions Fastest integer arithmetic: double Cache optimizations k ( ) 2 p 1 2 < 2 53 numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

78 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

79 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS with memory efficient schedules [Boyer, Dumas, P. and Zhou 09] Tradeoffs: Extra memory Fully in-place in 7.2n Overwriting input Leading constant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

80 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) OpenBLAS sgemm matrix dimension C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

81 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

82 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

83 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm 10 FFLAS fgemm over Z/ Z FFLAS fgemm over Z/ Z OpenBLAS dgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. p = , 1 mod / mul. p = , 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

84 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

85 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

86 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

87 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab iterative LAPACK Slab recursive FFLAS-FFPACK Tile iterative PLASMA Tile recursive FFLAS-FFPACK C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

88 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab recursive FFLAS-FFPACK Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

89 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms Data locality C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

90 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

91 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

92 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

93 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

94 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

95 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

96 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

97 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

98 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

99 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

100 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

101 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

102 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

103 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

104 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

105 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

106 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

107 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

108 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

109 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout Tiled Recursive 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n n 2 n log 2 n n Slab Recursive ( log 2 n)n n log 2 n n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

110 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=212) Left(k=212) Crout(k=212) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

111 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=n/3) Left(k=n/3) Crout(k=n/3) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

112 Reductions and building blocks In dense linear algebra Dealing with rank deficiencies and computing rank profiles Rank profiles: first linearly independent columns Major invariant of a matrix (echelon form) Gröbner basis computations (Macaulay matrix) Krylov methods Gaussian elimination revealing echelon forms: [Ibarra, Moran and Hui 82] A = L S P [Keller-Gehrig 85] X A = R [Jeannerod, P. and Storjohann 13] A = P L E C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

113 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

114 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting Tiled recursive PLUQ [Dumas P. Sultan 13,15] 1 Generalized to handle rank deficiency 4 recursive calls necessary in-place computation 2 Pivoting strategies exist to recover rank profile and echelon forms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

115 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 2 block splitting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

116 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

117 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

118 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

119 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

120 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

121 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

122 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 independent recursive calls C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

123 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

124 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

125 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

126 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

127 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

128 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

129 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Puzzle game (block cyclic rotations) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

130 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] O(mnr ω 2 ) (degenerating to 2/3n 3 ) computing col. and row rank profiles of all leading sub-matrices fewer modular reductions than slab algorithms rank deficiency introduces parallelism C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

131 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. A R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

132 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1} ColRP = {1} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

133 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,2} ColRP = {1,3} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

134 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,4} ColRP = {1,2} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

135 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 Ir U V A = P LUQ = P Q M 0 I m r I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

136 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P QQ T U V Q M I m r 0 I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

137 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

138 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q With appropriate pivoting: Π P,Q = R(A) U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

139 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

140 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

141 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

142 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n CharPoly(Z) Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... [Dumas, P. and Saunders 09] MinPoly(Z) CharPoly(Zp) Det(Zp) LinSys(Zp) [Dumas, P. Saunders 09] [Chen & al. 02] Rank(Zp) [Wiedemann 86] [Kaltofen Saunders 91] [Chen & al. 02] MinPoly(Zp) [Wiedemann 86] MatVecProd(Zp) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

143 Size dimension trade-offs Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

144 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

145 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

146 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) = O (n ω+1 log A )] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

147 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q Example LinSys Z (n) = O(n 3 log A 1+ɛ ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

148 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q High order lifting : [Storjohann 02,03] Fewer iteration steps larger dimension in the lifting Example LinSys Z (n) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

149 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

150 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

151 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

152 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

153 Size dimension trade-offs Building blocks and reductions In brief Reductions to a building block Matrix Mult: block rec. log n i=1 n ( ) n ω 1 2 = O(n ω ) (Gauss, REF) i Matrix Mult: Iterative n k=1 k ( ) n ω k = O(n ω ) (Frobenius) Linear Sys: over Z (Hermite Normal Form) Size/dimension compromises Hermite normal form : rank 1 updates reducing the determinant Frobenius normal form : degree k, dimension n/k for k = 1... n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

154 Size dimension trade-offs Hermite normal form Hermite normal form: naive algorithm p 1 x 1,2 x 1,3 Reduced Echelon form over a ring: p 2 x 2,3 with 0 x,j < p j. for i = 1... n do (g, t i,..., t n ) = xgcd(a i,i, A i+1,i,..., A n,i ) L i n j=i+1 t jl j for j = i n do L j L j A j,i g L i eliminate end for for j = 1... i 1 do L j L j A j,i g L i reduce end for end for C. Pernet Exact Linear Algebra Algorithmic July 6, / 73 p 3

155 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Example A = , H = det A = 1944 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

156 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Every computation can be done modulo d = det A: [ ] [ ] U A H = di n I n I n Example A = , H = det A = 1944 O(n 3 ) M(n(log n + log A )) = O (n 5 log A 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

157 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

158 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant [Micciancio & Warinschi 01] B b A = c d 1 = det ([ B c T d T a n 1,n a n,n ]), d 2 = det ([ ]) B d T g = gcd(d 1, d 2) = sd 1 + td 2 Then ([ ]) B det sc T + td T = g C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

159 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

160 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

161 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

162 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 Second approach: [Abbott Bronstein Mulders 99] Solve Ax = b. δ = lcm(q 1,..., q n ) s.t. x i = p i /q i Then δ is a large divisor of D = det A. Compute D/δ by LU mod p 1,..., p k + CRT k small, such that k i=1 p i > n n log A n/2 /δ C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

163 Size dimension trade-offs Hermite normal form Double Determinant : improved Property Let x = [x 1,..., x n ] be the solution of [ A c ] x = d. Then y = [ x 1 x n,..., x n 1 x n, 1 x n ] is the solution of [ A d ] y = c. 1 system solve 1 LU for each p i d 1, d 2 computed at about the cost of 1 déterminant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

164 Size dimension trade-offs Hermite normal form AddCol Problem Find a vector e such that [ H1 e ] [ B = U sc T + td T b sa n 1,n + ta n,n ] [ ] b e = U sa n 1,n + ta n,n [ ] 1 [ ] B b = H 1 sc T + td T sa n 1,n + ta n,n Solve a system. n 1 first rows are small last row is large C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

165 Size dimension trade-offs Hermite normal form AddCol Idea: replace the last row by a random small one w T. [ ] [ B w T y = b a n 1,n 1 Let {k} be a basis of the kernel of B. Then where x = y + αk. α = a n 1,n 1 (sc T + td T ) y (sc T + td T ) k limits the expensive arithmetic to a few dot products ] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

166 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form Definition Unique F = U 1 AU = Diag(C f0,..., C fk ) with f k f k 1... f 0. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

167 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k-shifted form: k k k C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

168 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k + 1-shifted form: k + 1 k + 1 d l C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

169 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 k + 1-shifted form: k + 1 k + 1 d l From k to k + 1-shifted in O(k( n k )ω ) Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

170 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

171 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

172 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= Generalized to the Frobenius form as well Transformation matrix in O(n ω log log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

173 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

174 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig dimension = n 2 i degree = 2 i 0 1 dimension = 1 degree = n det(xi n A) 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

175 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig 2 New algorithm dimension = n 2 i degree = 2 i dimension = n k degree = k dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

176 Parallel exact linear algebra Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

177 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

178 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints Exact linear algebra specificities cost affected by the splitting O(n ω ) for ω < 3 modular reductions coarse grain recursive algorithms rank deficiencies unbalanced task loads C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

179 Parallel exact linear algebra Ingredients for the parallelization Ingredients for the parallelization Criteria good performances portability across architectures abstraction for simplicity Challenging key point: scheduling as a plugin Program: only describes where the parallelism lies Runtime: scheduling & mapping, depending on the context of execution 3 main models: 1 Parallel loop [data parallelism] 2 Fork-Join (independent tasks) [task parallelism] 3 Dependent tasks with data flow dependencies [task parallelism] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

180 Parallel exact linear algebra Ingredients for the parallelization Data Parallelism OMP for ( int step = 0; step < 2; ++ step ){ # pragma omp parallel for for ( int i = 0; i < count ; ++ i) A[i] = (B[i +1] + B[i -1] + 2.0* B[i ])*0.25; } Limitation: very un-efficient with recursive parallel regions Limited to iterative algorithms No composition of routines C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

181 Parallel exact linear algebra Ingredients for the parallelization Task parallelism with fork-join Task based program: spawn + sync Especially suited for recursive programs OMP (since v3) void fibonacci ( long * result, long n) { if (n < 2) * result = n; else { long x,y; # pragma omp task fibonacci ( &x, n -1 ); fibonacci ( &y, n -2 ); # pragma omp taskwait * result = x + y; } } C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

182 Parallel exact linear algebra Ingredients for the parallelization Tasks with dataflow dependencies Task based model avoiding synchronizations Infer synchronizations from the read/write specifications A task is ready for execution when all its inputs variables are ready A variable is ready when it has been written Recently supported: Athapascan [96], Kaapi [06], StarSs [07], StarPU [08], Quark [10], OMP since v4 [14]... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

High Performance and Reliable Algebraic Computing

High Performance and Reliable Algebraic Computing High Performance and Reliable Algebraic Computing Soutenance d habilitation à diriger des recherches Clément Pernet Université Joseph Fourier (Grenoble 1) November 25, 2014 Rapporteurs : Daniel Augot,

More information

Eliminations and echelon forms in exact linear algebra

Eliminations and echelon forms in exact linear algebra Eliminations and echelon forms in exact linear algebra Clément PERNET, INRIA-MOAIS, Grenoble Université, France East Coast Computer Algebra Day, University of Waterloo, ON, Canada, April 9, 2011 Clément

More information

Toward High Performance Matrix Multiplication for Exact Computation

Toward High Performance Matrix Multiplication for Exact Computation Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014 Motivations

More information

Linear Algebra 1: Computing canonical forms in exact linear algebra

Linear Algebra 1: Computing canonical forms in exact linear algebra Linear Algebra : Computing canonical forms in exact linear algebra Clément PERNET, LIG/INRIA-MOAIS, Grenoble Université, France ECRYPT II: Summer School on Tools, Mykonos, Grece, June st, 202 Introduction

More information

hal , version 1-29 Feb 2008

hal , version 1-29 Feb 2008 Compressed Modular Matrix Multiplication Jean-Guillaume Dumas Laurent Fousse Bruno Salvy February 29, 2008 Abstract We propose to store several integers modulo a small prime into a single machine word.

More information

Algorithms for exact (dense) linear algebra

Algorithms for exact (dense) linear algebra Algorithms for exact (dense) linear algebra Gilles Villard CNRS, Laboratoire LIP ENS Lyon Montagnac-Montpezat, June 3, 2005 Introduction Problem: Study of complexity estimates for basic problems in exact

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

Computing the Rank Profile Matrix

Computing the Rank Profile Matrix Computing the Rank Profile Matrix Jean-Guillaume Dumas, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Clément Pernet, Ziad Sultan. Computing the Rank Profile Matrix. Kazuhiro

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Fast Matrix Multiplication Over GF3

Fast Matrix Multiplication Over GF3 Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015 Matrix Multiplication Outline Fast Matrix Multiplication Outline Fast Matrix

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Three Ways to Test Irreducibility

Three Ways to Test Irreducibility Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 12 Feb 2009 Outline Polynomials over finite fields Irreducibility criteria

More information

Three Ways to Test Irreducibility

Three Ways to Test Irreducibility Outline Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 8 Dec 2008 Polynomials over finite fields Irreducibility criteria

More information

Algorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis

Algorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis Algorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis Martin Albrecht M.R.Albrecht@rhul.ac.uk Thesis submitted to Royal Holloway, University of London for the degree of Doctor

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Matrix multiplication over word-size prime fields using Bini s approximate formula

Matrix multiplication over word-size prime fields using Bini s approximate formula Matrix multiplication over word-size prime fields using Bini s approximate formula Brice Boyer, Jean-Guillaume Dumas To cite this version: Brice Boyer, Jean-Guillaume Dumas. Matrix multiplication over

More information

arxiv: v2 [cs.sc] 14 May 2018

arxiv: v2 [cs.sc] 14 May 2018 Fast Computation of the Rank Profile Matrix and the Generalized Bruhat Decomposition arxiv:1601.01798v2 [cs.sc 14 May 2018 Jean-Guillaume Dumas Université Grenoble Alpes, Laboratoire LJK, umr CNRS, BP53X,

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Q-adic Transform revisited

Q-adic Transform revisited Q-adic Transform revisited Jean-Guillaume Dumas To cite this version: Jean-Guillaume Dumas. Q-adic Transform revisited. 2007. HAL Id: hal-00173894 https://hal.archives-ouvertes.fr/hal-00173894v3

More information

On the use of polynomial matrix approximant in the block Wiedemann algorithm

On the use of polynomial matrix approximant in the block Wiedemann algorithm On the use of polynomial matrix approximant in the block Wiedemann algorithm ECOLE NORMALE SUPERIEURE DE LYON Laboratoire LIP, ENS Lyon, France Pascal Giorgi Symbolic Computation Group, School of Computer

More information

Computation of the mtx-vec product based on storage scheme on vector CPUs

Computation of the mtx-vec product based on storage scheme on vector CPUs BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

The Hash Function JH 1

The Hash Function JH 1 The Hash Function JH 1 16 January, 2011 Hongjun Wu 2,3 wuhongjun@gmail.com 1 The design of JH is tweaked in this report. The round number of JH is changed from 35.5 to 42. This new version may be referred

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Implementation of the DKSS Algorithm for Multiplication of Large Numbers Implementation of the DKSS Algorithm for Multiplication of Large Numbers Christoph Lüders Universität Bonn The International Symposium on Symbolic and Algebraic Computation, July 6 9, 2015, Bath, United

More information

Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages

Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages Jean-Guillaume Dumas, Thierry Gautier, Pascal Giorgi, Clément Pernet To cite this version: Jean-Guillaume Dumas, Thierry Gautier,

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Inverting integer and polynomial matrices. Jo60. Arne Storjohann University of Waterloo

Inverting integer and polynomial matrices. Jo60. Arne Storjohann University of Waterloo Inverting integer and polynomial matrices Jo60 Arne Storjohann University of Waterloo Integer Matrix Inverse Input: An n n matrix A filled with entries of size d digits. Output: The matrix inverse A 1.

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Solving Sparse Rational Linear Systems. Pascal Giorgi. University of Waterloo (Canada) / University of Perpignan (France) joint work with

Solving Sparse Rational Linear Systems. Pascal Giorgi. University of Waterloo (Canada) / University of Perpignan (France) joint work with Solving Sparse Rational Linear Systems Pascal Giorgi University of Waterloo (Canada) / University of Perpignan (France) joint work with A. Storjohann, M. Giesbrecht (University of Waterloo), W. Eberly

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

An introduction to parallel algorithms

An introduction to parallel algorithms An introduction to parallel algorithms Knut Mørken Department of Informatics Centre of Mathematics for Applications University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 1/26

More information

McBits: Fast code-based cryptography

McBits: Fast code-based cryptography McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Cover and Decomposition Index Calculus on Elliptic Curves made practical

Cover and Decomposition Index Calculus on Elliptic Curves made practical Cover and Decomposition Index Calculus on Elliptic Curves made practical Application to a previously unreachable curve over F p 6 Vanessa VITSE Antoine JOUX Université de Versailles Saint-Quentin, Laboratoire

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

Software implementation of ECC

Software implementation of ECC Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia Software implementation of (H)ECC Radboud University,

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Yale university technical report #1402.

Yale university technical report #1402. The Mailman algorithm: a note on matrix vector multiplication Yale university technical report #1402. Edo Liberty Computer Science Yale University New Haven, CT Steven W. Zucker Computer Science and Appled

More information

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 74 6 Summary Here we summarize the most important information about theoretical and numerical linear algebra. MORALS OF THE STORY: I. Theoretically

More information

A variant of the F4 algorithm

A variant of the F4 algorithm A variant of the F4 algorithm Vanessa VITSE - Antoine JOUX Université de Versailles Saint-Quentin, Laboratoire PRISM CT-RSA, February 18, 2011 Motivation Motivation An example of algebraic cryptanalysis

More information

Why solve integer linear systems exactly? David Saunders - U of Delaware

Why solve integer linear systems exactly? David Saunders - U of Delaware NSF CDI, 2007Oct30 1 Why solve integer linear systems exactly?... an enabling tool for discovery and innovation? David Saunders - U of Delaware LinBox library at linalg.org NSF CDI, 2007Oct30 2 LinBoxers

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

CHAPMAN & HALL/CRC CRYPTOGRAPHY AND NETWORK SECURITY ALGORITHMIC CR YPTAN ALY51S. Ant nine J aux

CHAPMAN & HALL/CRC CRYPTOGRAPHY AND NETWORK SECURITY ALGORITHMIC CR YPTAN ALY51S. Ant nine J aux CHAPMAN & HALL/CRC CRYPTOGRAPHY AND NETWORK SECURITY ALGORITHMIC CR YPTAN ALY51S Ant nine J aux (g) CRC Press Taylor 8* Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor &

More information

Software implementation of Koblitz curves over quadratic fields

Software implementation of Koblitz curves over quadratic fields Software implementation of Koblitz curves over quadratic fields Thomaz Oliveira 1, Julio López 2 and Francisco Rodríguez-Henríquez 1 1 Computer Science Department, Cinvestav-IPN 2 Institute of Computing,

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Faster arithmetic for number-theoretic transforms

Faster arithmetic for number-theoretic transforms University of New South Wales 7th October 2011, Macquarie University Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Prime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation

Prime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation Galois field The issue Topic: finite fields with word size cardinality Field: 4 arithmetic operators to implement (+, -, *, /) We will focus on axpy: r = a x + y (operation mainly used in linear algebra

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015 GSKS GSKNN BLIS-Based High Performance Computing Kernels in N-body Problems Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015 N-body Problems Hellstorm Astronomy and 3D https://youtu.be/bllwkx_mrfk 2 N-body

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Lattice reduction of polynomial matrices

Lattice reduction of polynomial matrices Lattice reduction of polynomial matrices Arne Storjohann David R. Cheriton School of Computer Science University of Waterloo Presented at the SIAM conference on Applied Algebraic Geometry at the Colorado

More information

Information redundancy

Information redundancy Information redundancy Information redundancy add information to date to tolerate faults error detecting codes error correcting codes data applications communication memory p. 2 - Design of Fault Tolerant

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev . Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley

More information

Maximal Matching via Gaussian Elimination

Maximal Matching via Gaussian Elimination Maximal Matching via Gaussian Elimination Piotr Sankowski Uniwersytet Warszawski - p. 1/77 Outline Maximum Matchings Lovásza s Idea, Maximum matchings, Rabin and Vazirani, Gaussian Elimination, Simple

More information

Computation of dot products in finite fields with floating-point arithmetic

Computation of dot products in finite fields with floating-point arithmetic Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs

More information

Fast Lattice-Based Encryption: Stretching SPRING

Fast Lattice-Based Encryption: Stretching SPRING Fast Lattice-Based Encryption: Stretching SPRING Charles Bouillaguet 1 Claire Delaplace 1,2 Pierre-Alain Fouque 2 Paul Kirchner 3 1 CFHP team, CRIStAL, Université de Lille, France 2 EMSEC team, IRISA,

More information

Residue Number Systems Ivor Page 1

Residue Number Systems Ivor Page 1 Residue Number Systems 1 Residue Number Systems Ivor Page 1 7.1 Arithmetic in a modulus system The great speed of arithmetic in Residue Number Systems (RNS) comes from a simple theorem from number theory:

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Output-sensitive algorithms for sumset and sparse polynomial multiplication

Output-sensitive algorithms for sumset and sparse polynomial multiplication Output-sensitive algorithms for sumset and sparse polynomial multiplication Andrew Arnold Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Daniel S. Roche Computer Science

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

Fast computation of normal forms of polynomial matrices

Fast computation of normal forms of polynomial matrices 1/25 Fast computation of normal forms of polynomial matrices Vincent Neiger Inria AriC, École Normale Supérieure de Lyon, France University of Waterloo, Ontario, Canada Partially supported by the mobility

More information

Optimizing Scientific Libraries for the Itanium

Optimizing Scientific Libraries for the Itanium 0 Optimizing Scientific Libraries for the Itanium John Harrison Intel Corporation Gelato Federation Meeting, HP Cupertino May 25, 2005 1 Quick summary Intel supplies drop-in replacement versions of common

More information

Computing Characteristic Polynomials of Matrices of Structured Polynomials

Computing Characteristic Polynomials of Matrices of Structured Polynomials Computing Characteristic Polynomials of Matrices of Structured Polynomials Marshall Law and Michael Monagan Department of Mathematics Simon Fraser University Burnaby, British Columbia, Canada mylaw@sfu.ca

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Formal verification of IA-64 division algorithms

Formal verification of IA-64 division algorithms Formal verification of IA-64 division algorithms 1 Formal verification of IA-64 division algorithms John Harrison Intel Corporation IA-64 overview HOL Light overview IEEE correctness Division on IA-64

More information

Note Set 3 Numerical Linear Algebra

Note Set 3 Numerical Linear Algebra Note Set 3 Numerical Linear Algebra 3.1 Overview Numerical linear algebra includes a class of problems which involves multiplying vectors and matrices, solving linear equations, computing the inverse of

More information

Elliptic Curve Discrete Logarithm Problem

Elliptic Curve Discrete Logarithm Problem Elliptic Curve Discrete Logarithm Problem Vanessa VITSE Université de Versailles Saint-Quentin, Laboratoire PRISM October 19, 2009 Vanessa VITSE (UVSQ) Elliptic Curve Discrete Logarithm Problem October

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

CBEAM: Ecient Authenticated Encryption from Feebly One-Way φ Functions

CBEAM: Ecient Authenticated Encryption from Feebly One-Way φ Functions CBEAM: Ecient Authenticated Encryption from Feebly One-Way φ Functions Author: Markku-Juhani O. Saarinen Presented by: Jean-Philippe Aumasson CT-RSA '14, San Francisco, USA 26 February 2014 1 / 19 Sponge

More information

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ISSUED 24 FEBRUARY 2018 1 Gaussian elimination Let A be an (m n)-matrix Consider the following row operations on A (1) Swap the positions any

More information

Hecke Operators for Arithmetic Groups via Cell Complexes. Mark McConnell. Center for Communications Research, Princeton

Hecke Operators for Arithmetic Groups via Cell Complexes. Mark McConnell. Center for Communications Research, Princeton Hecke Operators for Arithmetic Groups via Cell Complexes 1 Hecke Operators for Arithmetic Groups via Cell Complexes Mark McConnell Center for Communications Research, Princeton Hecke Operators for Arithmetic

More information

Implementing Fast Carryless Multiplication

Implementing Fast Carryless Multiplication Implementing Fast Carryless Multiplication Joris van der Hoeven, Robin Larrieu and Grégoire Lecerf CNRS & École polytechnique MACIS 2017 Nov. 15, Vienna, Austria van der Hoeven, Larrieu, Lecerf Implementing

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information

Linear Systems of n equations for n unknowns

Linear Systems of n equations for n unknowns Linear Systems of n equations for n unknowns In many application problems we want to find n unknowns, and we have n linear equations Example: Find x,x,x such that the following three equations hold: x

More information

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline - Part 4 - Multicore and Manycore Technology: Chances and Challenges Vincent Heuveline 1 Numerical Simulation of Tropical Cyclones Goal oriented adaptivity for tropical cyclones ~10⁴km ~1500km ~100km 2

More information

Fast reversion of formal power series

Fast reversion of formal power series Fast reversion of formal power series Fredrik Johansson LFANT, INRIA Bordeaux RAIM, 2016-06-29, Banyuls-sur-mer 1 / 30 Reversion of power series F = exp(x) 1 = x + x 2 2! + x 3 3! + x 4 G = log(1 + x)

More information

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

GF(2 m ) arithmetic: summary

GF(2 m ) arithmetic: summary GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007 Outline Introduction

More information

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas BlockMatrixComputations and the Singular Value Decomposition ATaleofTwoIdeas Charles F. Van Loan Department of Computer Science Cornell University Supported in part by the NSF contract CCR-9901988. Block

More information

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm Chapter 2 Divide-and-conquer This chapter revisits the divide-and-conquer paradigms and explains how to solve recurrences, in particular, with the use of the master theorem. We first illustrate the concept

More information

This ensures that we walk downhill. For fixed λ not even this may be the case.

This ensures that we walk downhill. For fixed λ not even this may be the case. Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant

More information

Polynomial multiplication and division using heap.

Polynomial multiplication and division using heap. Polynomial multiplication and division using heap. Michael Monagan and Roman Pearce Department of Mathematics, Simon Fraser University. Abstract We report on new code for sparse multivariate polynomial

More information

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication March, 2006 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform

More information

Homomorphic Evaluation of the AES Circuit

Homomorphic Evaluation of the AES Circuit Homomorphic Evaluation of the AES Circuit IBM Research and University Of Bristol. August 22, 2012 Homomorphic Evaluation of the AES Circuit Slide 1 Executive Summary We present a working implementation

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information