Exact Linear Algebra Algorithmic: Theory and Practice

Size: px

Start display at page:

Download "Exact Linear Algebra Algorithmic: Theory and Practice"

Muriel McCoy
6 years ago
Views:

1 Exact Linear Algebra Algorithmic: Theory and Practice ISSAC 15 Tutorial Clément Pernet Université Grenoble Alpes, Inria, LIP-AriC July 6, 2015 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

2 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

3 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

4 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

5 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc Need to structure the design. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

6 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

7 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured Need for high performance. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

8 Content The scope of this presentation: not an exhaustive overview on linear algebra algorithmic and complexity improvements a few guidelines, for the use and design of exact linear algebra in practice bridging the theoretical algorihmic development and practical efficiency concerns C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

9 Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

10 Choosing the underlying arithmetic Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

11 Choosing the underlying arithmetic Achieving high practical efficiency Most of linear algebra operations boil down to (a lot of) y y ± a * b dot-product matrix-matrix multiplication rank 1 update in Gaussian elimination Schur complements,... Efficiency relies on fast arithmetic fast memory accesses Here: focus on dense linear algebra C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

12 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

13 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

14 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

15 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT GF(p k ) Z p [X]/(Q) Polynomial, Kronecker, Zech log,... Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

16 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 : bit-packing uint64 t (Z 2 ) 64 dot-product (a,b) ^ : bit-wise XOR, (+ mod 2) & : bit-wise AND, (* mod 2) uint64_ t t = 0; for ( int k =0; k < N /64; ++ k) t ^= a[k] & b[k]; c = parity ( t) parity(x) if ( size (x) == 1) return x; else return parity ( High ( x) ^ Low ( x)) Can be parallelized on 64 instances. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

17 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

18 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

19 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B For k = log n O(n 3 / log n). In pratice: choice of k s.t. the table fits in L2 cache. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

20 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 The M4RI library [Albrecht Bard Hart 10] bit-packing Method of the Four Russians SIMD vectorization of boolean instructions (128 bits registers) Cache optimization Strassen s O(n 2.81 ) algorithm n SIMD bool arithmetic 2.109s s SIMD + 4 Russians 0.256s 2.829s 29.28s SIMD + 4 Russians + Strassen 0.257s 2.001s Table : Matrix product n n by n n, on an i5 SandyBridge 2.6Ghz. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

21 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

22 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

23 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

24 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

25 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

26 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. Recipe for Z 5 Use redundant representations on 3 bits + bit-slicing integer add + bool operations Pseudo-reduction mod 5 (4 3 bits) in 8 bool ops found by computer assisted search. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

27 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

28 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 } 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

29 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

30 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 Maintain locally a bounding interval on all matrices computed } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

31 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

32 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax addq %rax, %rcx movq -80(%rbp), %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

33 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

34 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

35 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 mulsd -56(%rbp), %xmm0 addsd %xmm0, %xmm1 movq %xmm1, %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

36 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units + vectorization y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, mulsd -56(%rbp), %xmm0 vmulpd %ymm1, %ymm0, %ymm0 addsd %xmm0, %xmm1 vaddpd (%rsi,%rax),%ymm0, %ymm0 movq %xmm1, %rax vmovapd %ymm0, (%rsi,%rax) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

37 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data MMX 64 bits SSE 128bits + AVX 256 bits AVX bits 4 x 64 = 256 bits x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

38 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

39 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here Execution Unit parallelism: multiple arith. units acting simulatneously on distinct registers C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

40 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Sandybridge micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 4 1 Floating Pt. Add 4 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

41 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Haswell micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 4 Or: 1 Integer Mul 4 2 Integer Add 4 FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

42 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization AMD Bulldozer micro-architecture Scheduler Port 0 Port 1 Port 2 Port 3 2 x 64 = 128 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 2 x 64 = 128 bits FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

43 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Nehalem micro-architecture Scheduler Port 0 Port 1 Port 5 2 x 64 = 128 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 2 1 Floating Pt. Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

44 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

45 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

46 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

47 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

48 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

49 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

50 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

51 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

52 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

53 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

54 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers: ressources Micro-architecture bible: Agner Fog s software optimization resources [ Experiments: dgemm (double): OpenBLAS [ igemm (int64 t): Eigen [ & FFLAS-FFPACK [linalg.org/projects/fflas-ffpack] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

55 Choosing the underlying arithmetic Using machine word arithmetic Integer Packing 32 bits: half the precision twice the speed double double double double float float float float float float float float Gfops double float int64 t int32 t Intel SandyBridge Intel Haswell AMD Bulldozer C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

56 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

57 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits best bit computation throughput for double precision floating points. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

58 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

59 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed log 2 p GMP 58.1s 94.6s 140s RecInt 5.7s 28.6s 837s RNS 0.785s 1.42s 1.88s n = 1000, on an Intel SandyBridge. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

60 Choosing the underlying arithmetic Larger field sizes In practice Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

61 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

62 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p RecInt mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

63 Reductions and building blocks Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

64 Reductions and building blocks Reductions to building blocks Huge number of algorithmic variants for a given computation in O(n 3 ). Need to structure the design of set of routines : Focus tuning effort on a single routine Some operations deliver better efficiency: in practice: memory access pattern in theory: better algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

65 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Speed CPU arithmetic throughput Memory speed Time C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

66 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Speed CPU arithmetic throughput Memory speed Time RAM L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

67 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Need to overlap communications by computation Speed CPU arithmetic throughput Memory speed Time RAM Design of BLAS 3 [Dongarra & Al. 87] Group all ops in Matrix products gemm: Work O(n 3 ) >> Data O(n 2 ) MatMul has become a building block in practice L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

68 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

69 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

70 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

71 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) MatMul has become a building block in theoretical reductions C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

72 Reductions and building blocks In dense linear algebra Reductions: theory HNF(Z) SNF(Z) Det(Z) LinSys(Z) MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

73 Reductions and building blocks In dense linear algebra Reductions: theory Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

74 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice LU(Z p) CharPoly(Z p) reduction trees are still relevant TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

75 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) LinSys(Z) SNF(Z) Common mistrust Fast linear algebra is never faster numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice reduction trees are still relevant LU(Z p) CharPoly(Z p) TRSM(Z p) Road map towards efficiency in practice 1 Tune the MatMul building block. MM(Z p) 2 Tune the reductions. C. Pernet Exact Linear Algebra3Algorithmic New reductions. July 6, / 73

76 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions k ( ) 2 p 1 2 < 2 mantissa C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

77 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions Fastest integer arithmetic: double Cache optimizations k ( ) 2 p 1 2 < 2 53 numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

78 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

79 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS with memory efficient schedules [Boyer, Dumas, P. and Zhou 09] Tradeoffs: Extra memory Fully in-place in 7.2n Overwriting input Leading constant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

80 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) OpenBLAS sgemm matrix dimension C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

81 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

82 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

83 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm 10 FFLAS fgemm over Z/ Z FFLAS fgemm over Z/ Z OpenBLAS dgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. p = , 1 mod / mul. p = , 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

84 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

85 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

86 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

87 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab iterative LAPACK Slab recursive FFLAS-FFPACK Tile iterative PLASMA Tile recursive FFLAS-FFPACK C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

88 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab recursive FFLAS-FFPACK Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

89 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms Data locality C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

90 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

91 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

92 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

93 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

94 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

95 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

96 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

97 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

98 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

99 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

100 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

101 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

102 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

103 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

104 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

105 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

106 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

107 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

108 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

109 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout Tiled Recursive 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n n 2 n log 2 n n Slab Recursive ( log 2 n)n n log 2 n n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

110 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=212) Left(k=212) Crout(k=212) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

111 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=n/3) Left(k=n/3) Crout(k=n/3) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

112 Reductions and building blocks In dense linear algebra Dealing with rank deficiencies and computing rank profiles Rank profiles: first linearly independent columns Major invariant of a matrix (echelon form) Gröbner basis computations (Macaulay matrix) Krylov methods Gaussian elimination revealing echelon forms: [Ibarra, Moran and Hui 82] A = L S P [Keller-Gehrig 85] X A = R [Jeannerod, P. and Storjohann 13] A = P L E C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

113 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

114 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting Tiled recursive PLUQ [Dumas P. Sultan 13,15] 1 Generalized to handle rank deficiency 4 recursive calls necessary in-place computation 2 Pivoting strategies exist to recover rank profile and echelon forms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

115 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 2 block splitting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

116 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

117 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

118 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

119 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

120 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

121 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

122 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 independent recursive calls C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

123 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

124 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

125 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

126 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

127 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

128 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

129 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Puzzle game (block cyclic rotations) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

130 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] O(mnr ω 2 ) (degenerating to 2/3n 3 ) computing col. and row rank profiles of all leading sub-matrices fewer modular reductions than slab algorithms rank deficiency introduces parallelism C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

131 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. A R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

132 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1} ColRP = {1} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

133 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,2} ColRP = {1,3} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

134 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,4} ColRP = {1,2} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

135 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 Ir U V A = P LUQ = P Q M 0 I m r I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

136 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P QQ T U V Q M I m r 0 I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

137 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

138 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q With appropriate pivoting: Π P,Q = R(A) U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

139 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

140 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

141 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

142 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n CharPoly(Z) Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... [Dumas, P. and Saunders 09] MinPoly(Z) CharPoly(Zp) Det(Zp) LinSys(Zp) [Dumas, P. Saunders 09] [Chen & al. 02] Rank(Zp) [Wiedemann 86] [Kaltofen Saunders 91] [Chen & al. 02] MinPoly(Zp) [Wiedemann 86] MatVecProd(Zp) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

143 Size dimension trade-offs Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

144 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

145 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

146 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) = O (n ω+1 log A )] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

147 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q Example LinSys Z (n) = O(n 3 log A 1+ɛ ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

148 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q High order lifting : [Storjohann 02,03] Fewer iteration steps larger dimension in the lifting Example LinSys Z (n) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

149 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

150 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

151 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

152 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

153 Size dimension trade-offs Building blocks and reductions In brief Reductions to a building block Matrix Mult: block rec. log n i=1 n ( ) n ω 1 2 = O(n ω ) (Gauss, REF) i Matrix Mult: Iterative n k=1 k ( ) n ω k = O(n ω ) (Frobenius) Linear Sys: over Z (Hermite Normal Form) Size/dimension compromises Hermite normal form : rank 1 updates reducing the determinant Frobenius normal form : degree k, dimension n/k for k = 1... n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

154 Size dimension trade-offs Hermite normal form Hermite normal form: naive algorithm p 1 x 1,2 x 1,3 Reduced Echelon form over a ring: p 2 x 2,3 with 0 x,j < p j. for i = 1... n do (g, t i,..., t n ) = xgcd(a i,i, A i+1,i,..., A n,i ) L i n j=i+1 t jl j for j = i n do L j L j A j,i g L i eliminate end for for j = 1... i 1 do L j L j A j,i g L i reduce end for end for C. Pernet Exact Linear Algebra Algorithmic July 6, / 73 p 3

155 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Example A = , H = det A = 1944 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

156 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Every computation can be done modulo d = det A: [ ] [ ] U A H = di n I n I n Example A = , H = det A = 1944 O(n 3 ) M(n(log n + log A )) = O (n 5 log A 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

157 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

158 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant [Micciancio & Warinschi 01] B b A = c d 1 = det ([ B c T d T a n 1,n a n,n ]), d 2 = det ([ ]) B d T g = gcd(d 1, d 2) = sd 1 + td 2 Then ([ ]) B det sc T + td T = g C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

159 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

160 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

161 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

162 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 Second approach: [Abbott Bronstein Mulders 99] Solve Ax = b. δ = lcm(q 1,..., q n ) s.t. x i = p i /q i Then δ is a large divisor of D = det A. Compute D/δ by LU mod p 1,..., p k + CRT k small, such that k i=1 p i > n n log A n/2 /δ C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

163 Size dimension trade-offs Hermite normal form Double Determinant : improved Property Let x = [x 1,..., x n ] be the solution of [ A c ] x = d. Then y = [ x 1 x n,..., x n 1 x n, 1 x n ] is the solution of [ A d ] y = c. 1 system solve 1 LU for each p i d 1, d 2 computed at about the cost of 1 déterminant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

164 Size dimension trade-offs Hermite normal form AddCol Problem Find a vector e such that [ H1 e ] [ B = U sc T + td T b sa n 1,n + ta n,n ] [ ] b e = U sa n 1,n + ta n,n [ ] 1 [ ] B b = H 1 sc T + td T sa n 1,n + ta n,n Solve a system. n 1 first rows are small last row is large C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

165 Size dimension trade-offs Hermite normal form AddCol Idea: replace the last row by a random small one w T. [ ] [ B w T y = b a n 1,n 1 Let {k} be a basis of the kernel of B. Then where x = y + αk. α = a n 1,n 1 (sc T + td T ) y (sc T + td T ) k limits the expensive arithmetic to a few dot products ] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

166 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form Definition Unique F = U 1 AU = Diag(C f0,..., C fk ) with f k f k 1... f 0. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

167 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k-shifted form: k k k C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

168 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k + 1-shifted form: k + 1 k + 1 d l C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

169 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 k + 1-shifted form: k + 1 k + 1 d l From k to k + 1-shifted in O(k( n k )ω ) Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

170 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

171 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

172 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= Generalized to the Frobenius form as well Transformation matrix in O(n ω log log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

173 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

174 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig dimension = n 2 i degree = 2 i 0 1 dimension = 1 degree = n det(xi n A) 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

175 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig 2 New algorithm dimension = n 2 i degree = 2 i dimension = n k degree = k dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

176 Parallel exact linear algebra Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

177 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

178 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints Exact linear algebra specificities cost affected by the splitting O(n ω ) for ω < 3 modular reductions coarse grain recursive algorithms rank deficiencies unbalanced task loads C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

179 Parallel exact linear algebra Ingredients for the parallelization Ingredients for the parallelization Criteria good performances portability across architectures abstraction for simplicity Challenging key point: scheduling as a plugin Program: only describes where the parallelism lies Runtime: scheduling & mapping, depending on the context of execution 3 main models: 1 Parallel loop [data parallelism] 2 Fork-Join (independent tasks) [task parallelism] 3 Dependent tasks with data flow dependencies [task parallelism] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

180 Parallel exact linear algebra Ingredients for the parallelization Data Parallelism OMP for ( int step = 0; step < 2; ++ step ){ # pragma omp parallel for for ( int i = 0; i < count ; ++ i) A[i] = (B[i +1] + B[i -1] + 2.0* B[i ])*0.25; } Limitation: very un-efficient with recursive parallel regions Limited to iterative algorithms No composition of routines C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

181 Parallel exact linear algebra Ingredients for the parallelization Task parallelism with fork-join Task based program: spawn + sync Especially suited for recursive programs OMP (since v3) void fibonacci ( long * result, long n) { if (n < 2) * result = n; else { long x,y; # pragma omp task fibonacci ( &x, n -1 ); fibonacci ( &y, n -2 ); # pragma omp taskwait * result = x + y; } } C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

all its inputs variables are ready A variable is ready when it has been written Recently supported: Athapascan [96],

182 Parallel exact linear algebra Ingredients for the parallelization Tasks with dataflow dependencies Task based model avoiding synchronizations Infer synchronizations from the read/write specifications A task is ready for execution when all its inputs variables are ready A variable is ready when it has been written Recently supported: Athapascan [96], Kaapi [06], StarSs [07], StarPU [08], Quark [10], OMP since v4 [14]... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73

High Performance and Reliable Algebraic Computing

High Performance and Reliable Algebraic Computing Soutenance d habilitation à diriger des recherches Clément Pernet Université Joseph Fourier (Grenoble 1) November 25, 2014 Rapporteurs : Daniel Augot,