Exact Linear Algebra Algorithmic: Theory and Practice
|
|
- Muriel McCoy
- 6 years ago
- Views:
Transcription
1 Exact Linear Algebra Algorithmic: Theory and Practice ISSAC 15 Tutorial Clément Pernet Université Grenoble Alpes, Inria, LIP-AriC July 6, 2015 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
2 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
3 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
4 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
5 Exact linear algebra Matrices can be Dense: store all coefficients Sparse: store the non-zero coefficients only Black-box: no access to the storage, only apply to a vector Coefficient domains: Word size: integers with a priori bounds Z/pZ for p of 32 bits Multi-precision: Z/pZ for p of 100, 200, 1000, 2000,... bits Arbitrary precision: Z, Q Polynomials: K[X] for K any of the above Several implemenations for the same domain: better fits FFT, LinAlg, etc Need to structure the design. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
6 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
7 Exact linear algebra Motivations Comp. Number Theory: CharPoly, LinSys, Echelon, over Z, Q, Z/pZ, Dense Graph Theory: MatMul, CharPoly, Det, over Z, Sparse Discrete log.: LinSys, over Z/pZ, p 120 bits, Sparse Integer Factorization: NullSpace, over Z/2Z, Sparse Algebraic Attacks: Echelon, LinSys, over Z/pZ, p 20 bits, Sparse & Dense List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured Need for high performance. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
8 Content The scope of this presentation: not an exhaustive overview on linear algebra algorithmic and complexity improvements a few guidelines, for the use and design of exact linear algebra in practice bridging the theoretical algorihmic development and practical efficiency concerns C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
9 Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
10 Choosing the underlying arithmetic Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
11 Choosing the underlying arithmetic Achieving high practical efficiency Most of linear algebra operations boil down to (a lot of) y y ± a * b dot-product matrix-matrix multiplication rank 1 update in Gaussian elimination Schur complements,... Efficiency relies on fast arithmetic fast memory accesses Here: focus on dense linear algebra C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
12 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
13 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit Z 3,5,7 2-3 bits Z p 4-26 bits Z, Q > 32 bits Z p > 32 bits Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
14 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
15 Choosing the underlying arithmetic Which computer arithmetic? Many base fields/rings to support Z 2 1 bit bit-packing Z 3,5,7 2-3 bits bit-slicing, bit-packing Z p 4-26 bits CPU arithmetic Z, Q > 32 bits multiprec. ints, big ints,crt, lifting Z p > 32 bits multiprec. ints, big ints, CRT GF(p k ) Z p [X]/(Q) Polynomial, Kronecker, Zech log,... Available CPU arithmetic boolean integer (fixed size) floating point.. and their vectorization C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
16 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 : bit-packing uint64 t (Z 2 ) 64 dot-product (a,b) ^ : bit-wise XOR, (+ mod 2) & : bit-wise AND, (* mod 2) uint64_ t t = 0; for ( int k =0; k < N /64; ++ k) t ^= a[k] & b[k]; c = parity ( t) parity(x) if ( size (x) == 1) return x; else return parity ( High ( x) ^ Low ( x)) Can be parallelized on 64 instances. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
17 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
18 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
19 Choosing the underlying arithmetic Using boolean arithmetic Tabulation: avoid computing parities balance computation vs communication (slight) complexity improvement possible The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70] 1 compute all 2 k linear combinations of k rows of B. Gray code: each new line costs 1 vector add (vs k/2) 2 multiply chunks of length k of A by table look-up B1 B2 B3 0 B3 B2+B3 B2 B1+B2 B1+B2+B3 B1+B3 B For k = log n O(n 3 / log n). In pratice: choice of k s.t. the table fits in L2 cache. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
20 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 2 The M4RI library [Albrecht Bard Hart 10] bit-packing Method of the Four Russians SIMD vectorization of boolean instructions (128 bits registers) Cache optimization Strassen s O(n 2.81 ) algorithm n SIMD bool arithmetic 2.109s s SIMD + 4 Russians 0.256s 2.829s 29.28s SIMD + 4 Russians + Strassen 0.257s 2.001s Table : Matrix product n n by n n, on an i5 SandyBridge 2.6Ghz. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
21 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
22 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
23 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
24 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
25 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
26 Choosing the underlying arithmetic Using boolean arithmetic Dense linear algebra over Z 3, Z 5 [Boothby & Bradshaw 09] Z 3 = {0, 1, 1} = {00, 01, 10} add/sub in 7 bool ops = {00, 10, 11} add/sub in 6 bool ops Bit-slicing ( 1, 0, 1, 0, 1, 1, 1, 0) Z 8 3 (11, 00, 10, 00, 10, 11, 00) Stored as 2 words (1,0,1,0,1,1,0) (1,0,0,0,0,1,0) y y + x b for x Z 3, y, b Z 64 3 in 6 boolean word ops. Recipe for Z 5 Use redundant representations on 3 bits + bit-slicing integer add + bool operations Pseudo-reduction mod 5 (4 3 bits) in 8 bool ops found by computer assisted search. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
27 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
28 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 } 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
29 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
30 Choosing the underlying arithmetic Using machine word arithmetic Dense linear algebra over Z p for word-size p Delayed modular reductions 1 Compute using integer arithmetic 2 Reduce modulo p only when necessary When to reduce? Bound the values of all intermediate computations. A priori: Representation of Z p {0... p 1} { p p 1 Scalar product, Classic MatMul n(p 1) 2 n ( p 1 2 Strassen-Winograd MatMul (l rec. levels) ( 1+3l ) 2 n (p 1) 2 9 l n ( p l 2 l 2 Maintain locally a bounding interval on all matrices computed } 2 ) 2 ) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
31 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
32 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax addq %rax, %rcx movq -80(%rbp), %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
33 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
34 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
35 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 mulsd -56(%rbp), %xmm0 addsd %xmm0, %xmm1 movq %xmm1, %rax C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
36 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers How to compute with (machine word size) integers efficiently? 1 use CPU s integer arithmetic units + vectorization y += a * b: correct if ab + y < 2 63 a, b < 2 31 movq (%rax,%rdx,8), %rax imulq -56(%rbp), %rax vpmuludq %xmm3, %xmm0,%xmm0 addq %rax, %rcx vpaddq %xmm2,%xmm0,%xmm0 movq -80(%rbp), %rax vpsllq $32,%xmm0,%xmm0 2 use CPU s floating point units + vectorization y += a * b: correct if ab + y < 2 53 a, b < 2 26 movsd (%rax,%rdx,8), %xmm0 vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, mulsd -56(%rbp), %xmm0 vmulpd %ymm1, %ymm0, %ymm0 addsd %xmm0, %xmm1 vaddpd (%rsi,%rax),%ymm0, %ymm0 movq %xmm1, %rax vmovapd %ymm0, (%rsi,%rax) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
37 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data MMX 64 bits SSE 128bits + AVX 256 bits AVX bits 4 x 64 = 256 bits x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
38 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
39 Choosing the underlying arithmetic Using machine word arithmetic Exploiting in-core parallelism Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data 4 x 64 = 256 bits MMX 64 bits SSE 128bits x[0] x[1] x[2] x[3] + AVX 256 bits y[0] y[1] y[2] y[3] AVX bits x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3] Pipeline: amortize the latency of an operation when used repeatedly throughput of 1 op/ Cycle for all arithmetic ops considered here Execution Unit parallelism: multiple arith. units acting simulatneously on distinct registers C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
40 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Sandybridge micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 4 1 Floating Pt. Add 4 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
41 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Haswell micro-architecture Scheduler Port 0 Port 1 Port 5 4 x 64 = 256 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 4 Or: 1 Integer Mul 4 2 Integer Add 4 FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
42 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization AMD Bulldozer micro-architecture Scheduler Port 0 Port 1 Port 2 Port 3 2 x 64 = 128 bits FMA Int MUL FMA Int ADD Int ADD Performs at every clock cycle: 2 Floating Pt. Mul & Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 2 x 64 = 128 bits FMA: Fused Multiplying & Accumulate, c += a * b C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
43 Choosing the underlying arithmetic Using machine word arithmetic SIMD and vectorization Intel Nehalem micro-architecture Scheduler Port 0 Port 1 Port 5 2 x 64 = 128 bits FMUL Int MUL FADD Int ADD Int ADD 2 x 64 = 128 bits Performs at every clock cycle: 1 Floating Pt. Mul 2 1 Floating Pt. Add 2 Or: 1 Integer Mul 2 2 Integer Add 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
44 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
45 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge AVX1 AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
46 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
47 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer FMA4 Intel Nehalem SSE4 INT FP INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
48 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
49 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem SSE4 INT FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
50 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
51 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
52 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
53 Choosing the underlying arithmetic Using machine word arithmetic Summary: 64 bits AXPY throughput Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT AVX2 FP Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq ( # daxpy / Cycle) 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
54 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers: ressources Micro-architecture bible: Agner Fog s software optimization resources [ Experiments: dgemm (double): OpenBLAS [ igemm (int64 t): Eigen [ & FFLAS-FFPACK [linalg.org/projects/fflas-ffpack] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
55 Choosing the underlying arithmetic Using machine word arithmetic Integer Packing 32 bits: half the precision twice the speed double double double double float float float float float float float float Gfops double float int64 t int32 t Intel SandyBridge Intel Haswell AMD Bulldozer C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
56 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
57 Choosing the underlying arithmetic Using machine word arithmetic Computing over fixed size integers Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p Modulo p (bitsize) SandyBridge i5-3320m@3.3ghz. n = Take home message Floating pt. arith. delivers the highest speed (except in corner cases) 32 bits twice as fast as 64 bits best bit computation throughput for double precision floating points. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
58 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
59 Choosing the underlying arithmetic Larger field sizes Larger finite fields: log 2 p 32 As before: 1 Use adequate integer arithmetic 2 reduce modulo p only when necessary Which integer arithmetic? 1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed log 2 p GMP 58.1s 94.6s 140s RecInt 5.7s 28.6s 837s RNS 0.785s 1.42s 1.88s n = 1000, on an Intel SandyBridge. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
60 Choosing the underlying arithmetic Larger field sizes In practice Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
61 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
62 In practice Choosing the underlying arithmetic Larger field sizes Speed (Gfops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p Modulo p (bitsize) Bit speed (Gbops) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p RecInt mod p Modulo p (bitsize) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
63 Reductions and building blocks Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
64 Reductions and building blocks Reductions to building blocks Huge number of algorithmic variants for a given computation in O(n 3 ). Need to structure the design of set of routines : Focus tuning effort on a single routine Some operations deliver better efficiency: in practice: memory access pattern in theory: better algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
65 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Speed CPU arithmetic throughput Memory speed Time C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
66 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Speed CPU arithmetic throughput Memory speed Time RAM L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
67 Reductions and building blocks In dense linear algebra Memory access pattern The memory wall: communication speed improves slower than arithmetic Deep memory hierarchy Need to overlap communications by computation Speed CPU arithmetic throughput Memory speed Time RAM Design of BLAS 3 [Dongarra & Al. 87] Group all ops in Matrix products gemm: Work O(n 3 ) >> Data O(n 2 ) MatMul has become a building block in practice L3 L2 L1 CPU C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
68 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
69 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
70 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
71 Reductions and building blocks In dense linear algebra Sub-cubic linear algebra < 1969: O(n 3 ) for everyone (Gauss, Householder, Danilevskĭi, etc) Matrix Multiplication O(n ω ) [Strassen 69]: O(n ). [Schönhage 81] O(n 2.52 ). [Coppersmith, Winograd 90] O(n ). [Le Gall 14] O(n ) Other operations [Strassen 69]: Inverse in O(n ω ) [Schönhage 72]: QR in O(n ω ) [Bunch, Hopcroft 74]: LU in O(n ω ) [Ibarra & al. 82]: Rank in O(n ω ) [Keller-Gehrig 85]: CharPoly in O(n ω log n) MatMul has become a building block in theoretical reductions C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
72 Reductions and building blocks In dense linear algebra Reductions: theory HNF(Z) SNF(Z) Det(Z) LinSys(Z) MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
73 Reductions and building blocks In dense linear algebra Reductions: theory Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable MM(Z) LinSys(Z p) Det(Z p) MinPoly(Z p) LU(Z p) CharPoly(Z p) TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
74 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) SNF(Z) Common mistrust Fast linear algebra is never faster LinSys(Z) numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice LU(Z p) CharPoly(Z p) reduction trees are still relevant TRSM(Z p) MM(Z p) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
75 Reductions and building blocks In dense linear algebra Reductions: theory and practice Det(Z) HNF(Z) LinSys(Z) SNF(Z) Common mistrust Fast linear algebra is never faster numerically unstable LinSys(Z p) Det(Z p) MinPoly(Z p) MM(Z) Lucky coincidence same building block in theory and in practice reduction trees are still relevant LU(Z p) CharPoly(Z p) TRSM(Z p) Road map towards efficiency in practice 1 Tune the MatMul building block. MM(Z p) 2 Tune the reductions. C. Pernet Exact Linear Algebra3Algorithmic New reductions. July 6, / 73
76 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions k ( ) 2 p 1 2 < 2 mantissa C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
77 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions Fastest integer arithmetic: double Cache optimizations k ( ) 2 p 1 2 < 2 53 numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
78 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
79 Reductions and building blocks In dense linear algebra Putting it together: MatMul building block over Z/pZ Ingedients [FFLAS-FFPACK library] Compute over Z and delay modular reductions 9 l ( ) 2 k p 1 2 l 2 < 2 53 Fastest integer arithmetic: double Cache optimizations Strassen-Winograd 6n numerical BLAS with memory efficient schedules [Boyer, Dumas, P. and Zhou 09] Tradeoffs: Extra memory Fully in-place in 7.2n Overwriting input Leading constant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
80 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) OpenBLAS sgemm matrix dimension C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
81 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
82 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
83 Reductions and building blocks In dense linear algebra Sequential Matrix Multiplication i5 3320M at 2.6Ghz with AVX n 3 /time/10 9 (Gfops equiv.) FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm 10 FFLAS fgemm over Z/ Z FFLAS fgemm over Z/ Z OpenBLAS dgemm matrix dimension p = 83, 1 mod / mul. p = 821, 1 mod / 100 mul. p = , 1 mod / mul. p = , 1 mod / 100 mul. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
84 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
85 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
86 Reductions and building blocks In dense linear algebra Reductions in dense linear algebra LU decomposition Block recursive algorithm reduces to MatMul O(n ω ) n LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s fflas-ffpack 0.058s 2.46s 16.08s 47.47s s Intel Haswell E Ghz using OpenBLAS Characteristic Polynomial A new reduction to matrix multiplication in O(n ω ). n magma-v s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i Ghz using OpenBLAS C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
87 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab iterative LAPACK Slab recursive FFLAS-FFPACK Tile iterative PLASMA Tile recursive FFLAS-FFPACK C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
88 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Slab recursive FFLAS-FFPACK Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
89 Reductions and building blocks In dense linear algebra The case of Gaussian elimination Which reduction to MatMul? Tile recursive FFLAS-FFPACK Sub-cubic complexity: recursive algorithms Data locality C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
90 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
91 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
92 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
93 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
94 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
95 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
96 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
97 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
98 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
99 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
100 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
101 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
102 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
103 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
104 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
105 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
106 Reductions and building blocks In dense linear algebra Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
107 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k 2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
108 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
109 Reductions and building blocks In dense linear algebra Counting Modular Reductions k 1 k = 1 Tiled Iter. Right looking Tiled Iter. Left looking Tiled Iter. Crout Iter. Right looking Iter. Left Looking Iter. Crout Tiled Recursive 1 3k n3 + ( ) 1 1 k n 2 + ( 1 k 5 + ) k n ( ) 2 1 2k n 2 + ( 5 k 1 + ) 2 2 k n + 2k 2 2k + 1 ( 5 ) 1 2 k n 2 + ( 2k 5 + ) 3 2 k n + k n3 1 n n2 3 n n2 7 n n 2 n log 2 n n Slab Recursive ( log 2 n)n n log 2 n n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
110 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=212) Left(k=212) Crout(k=212) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
111 Reductions and building blocks In dense linear algebra Impact in practice (2/3)n 3 /time/ sequential LU decomposition variants on one core Right(k=n/3) Left(k=n/3) Crout(k=n/3) Tiled-Rec Slab-Rec matrix dimension As anticipated : Right-looking < Crout < Left-looking C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
112 Reductions and building blocks In dense linear algebra Dealing with rank deficiencies and computing rank profiles Rank profiles: first linearly independent columns Major invariant of a matrix (echelon form) Gröbner basis computations (Macaulay matrix) Krylov methods Gaussian elimination revealing echelon forms: [Ibarra, Moran and Hui 82] A = L S P [Keller-Gehrig 85] X A = R [Jeannerod, P. and Storjohann 13] A = P L E C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
113 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
114 Reductions and building blocks In dense linear algebra Computing rank profiles Lessons learned (or what we thought was necessary): treat rows in order exhaust all columns before considering the next row slab block splitting required (recursive or iterative) similar to partial pivoting Tiled recursive PLUQ [Dumas P. Sultan 13,15] 1 Generalized to handle rank deficiency 4 recursive calls necessary in-place computation 2 Pivoting strategies exist to recover rank profile and echelon forms C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
115 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 2 block splitting C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
116 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
117 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
118 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
119 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
120 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
121 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
122 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] 2 independent recursive calls C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
123 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B BU 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
124 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] TRSM: B L 1 B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
125 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
126 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
127 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] MatMul: C C A B C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
128 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Recursive call C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
129 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] Puzzle game (block cyclic rotations) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
130 Reductions and building blocks In dense linear algebra A tiled recursive algorithm [Dumas, P. and Sultan 13] O(mnr ω 2 ) (degenerating to 2/3n 3 ) computing col. and row rank profiles of all leading sub-matrices fewer modular reductions than slab algorithms rank deficiency introduces parallelism C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
131 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. A R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
132 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1} ColRP = {1} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
133 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,2} ColRP = {1,3} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
134 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) Same holds for any (i, j)-leading submatrix. A RowRP = {1,4} ColRP = {1,2} R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
135 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 Ir U V A = P LUQ = P Q M 0 I m r I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
136 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P QQ T U V Q M I m r 0 I n r R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
137 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
138 Reductions and building blocks In dense linear algebra Computing all rank profiles at once Dumas, P. and Sultan ISSAC 15 (Thursday 3PM ) Definition (Rank Profile matrix) The unique R A {0, 1} m n such that any pair of (i, j)-leading sub-matrix of R A and of A have the same rank. Theorem RowRP and ColRP read directly on R(A) L A Same holds for any (i, j)-leading submatrix. RowRP = {1,4} ColRP = {1,2} [ ] [ ] [ ] L 0 A = P LUQ = P P T Ir P Q Q T U V Q M I m r 0 I n r }{{}}{{}}{{} Π P,Q With appropriate pivoting: Π P,Q = R(A) U R C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
139 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
140 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
141 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
142 Reductions and building blocks In blackbox linear algebra Reductions in black box linear algebra A K n n v K n Av K n CharPoly(Z) Matrix-Vector Product: building block, costs E(n) Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n 2 ) Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (ne(n) + n 2 ) Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank,... [Dumas, P. and Saunders 09] MinPoly(Z) CharPoly(Zp) Det(Zp) LinSys(Zp) [Dumas, P. Saunders 09] [Chen & al. 02] Rank(Zp) [Wiedemann 86] [Kaltofen Saunders 91] [Chen & al. 02] MinPoly(Zp) [Wiedemann 86] MatVecProd(Zp) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
143 Size dimension trade-offs Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
144 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
145 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
146 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Multimodular methods over K[X]: evaluation-interpolation over Z,Q: Chinese Remainder Theorem Cost = Algebraic Cost Size(Output) avoids coefficient blow-up uniform (worst case) cost for all arithmetic ops Example Hadamard s bound: det(a) ( A n) n. LinSys Z (n) = O(n ω n(log n + log A )) = O (n ω+1 log A )] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
147 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q Example LinSys Z (n) = O(n 3 log A 1+ɛ ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
148 Size dimension trade-offs Size Dimension trade-offs Computing with coefficients of varying size: Z, Q, K[X],... Lifting techniques p-adic lifting: [Moenck & Carter 79, Dixon 82] One computation over Z p Iterative lifting of the solution to Z, Q High order lifting : [Storjohann 02,03] Fewer iteration steps larger dimension in the lifting Example LinSys Z (n) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
149 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
150 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
151 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
152 Size dimension trade-offs Improving time Complexities Matrix multiplication: door to fast linear algebra over Z: O(n ω M(log A )) = O (n ω log A ) Equivalence over Z or K[X]: Hermite normal form [Kannan & Bachem 79]: P [Chou & Collins 82]: O (n 6 log A ) [Domich & Al. 87], [Illiopoulos 89]: O (n 4 log A ) [Micciancio & Warinschi01]: O (n 5 log A 2 ), O (n 3 log A ) heur. [Storjohann & Labahn 96]: O (n ω+1 log A ) [Gupta & Storjohann 11]: O (n 3 log A ) Similarity over a field: Frobenius normal form [Giesbrecht 93]: O (n ω ) probabilistic [Storjohann 00]: O (n ω ) deterministic [P. & Storjohann 07]: O(n ω ) probabilistic C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
153 Size dimension trade-offs Building blocks and reductions In brief Reductions to a building block Matrix Mult: block rec. log n i=1 n ( ) n ω 1 2 = O(n ω ) (Gauss, REF) i Matrix Mult: Iterative n k=1 k ( ) n ω k = O(n ω ) (Frobenius) Linear Sys: over Z (Hermite Normal Form) Size/dimension compromises Hermite normal form : rank 1 updates reducing the determinant Frobenius normal form : degree k, dimension n/k for k = 1... n C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
154 Size dimension trade-offs Hermite normal form Hermite normal form: naive algorithm p 1 x 1,2 x 1,3 Reduced Echelon form over a ring: p 2 x 2,3 with 0 x,j < p j. for i = 1... n do (g, t i,..., t n ) = xgcd(a i,i, A i+1,i,..., A n,i ) L i n j=i+1 t jl j for j = i n do L j L j A j,i g L i eliminate end for for j = 1... i 1 do L j L j A j,i g L i reduce end for end for C. Pernet Exact Linear Algebra Algorithmic July 6, / 73 p 3
155 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Example A = , H = det A = 1944 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
156 Size dimension trade-offs Hermite normal form Computing modulo the determinant [Domich & Al. 87] Property For A non-singular: max i j H ij det H = det A Every computation can be done modulo d = det A: [ ] [ ] U A H = di n I n I n Example A = , H = det A = 1944 O(n 3 ) M(n(log n + log A )) = O (n 5 log A 2 ) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
157 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
158 Size dimension trade-offs Hermite normal form Computing modulo the determinant Pessimistic estimate on the arithmetic size d large but most coefficients of H are small On average : only the last few columns are large Compute H close to H but with small determinant [Micciancio & Warinschi 01] B b A = c d 1 = det ([ B c T d T a n 1,n a n,n ]), d 2 = det ([ ]) B d T g = gcd(d 1, d 2) = sd 1 + td 2 Then ([ ]) B det sc T + td T = g C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
159 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
160 Size dimension trade-offs Hermite normal form Micciancio & Warinschi algorithm ([ B Compute d 1 = det c T ]), d 2 = det ([ ]) B d T (g, s, t) = xgcd(d 1, d 2 ) [ ] B Compute H 1 the HNF of sc T + td T mod g [ ] B b Recover H 2 the HNF of sc T + td T sa n 1,n + ta n,n B b Recover H 3 the HNF of c T d T a n 1,n a n,n Double Det Modular HNF AddCol AddRow C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
161 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
162 Size dimension trade-offs Hermite normal form Double Determinant First approach: LU mod p 1,..., p k + CRT Only one elimination for the n 2 first rows 2 updates for the last rows (triangular back substitution) k large such that k i=1 p i > n n log A n/2 Second approach: [Abbott Bronstein Mulders 99] Solve Ax = b. δ = lcm(q 1,..., q n ) s.t. x i = p i /q i Then δ is a large divisor of D = det A. Compute D/δ by LU mod p 1,..., p k + CRT k small, such that k i=1 p i > n n log A n/2 /δ C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
163 Size dimension trade-offs Hermite normal form Double Determinant : improved Property Let x = [x 1,..., x n ] be the solution of [ A c ] x = d. Then y = [ x 1 x n,..., x n 1 x n, 1 x n ] is the solution of [ A d ] y = c. 1 system solve 1 LU for each p i d 1, d 2 computed at about the cost of 1 déterminant C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
164 Size dimension trade-offs Hermite normal form AddCol Problem Find a vector e such that [ H1 e ] [ B = U sc T + td T b sa n 1,n + ta n,n ] [ ] b e = U sa n 1,n + ta n,n [ ] 1 [ ] B b = H 1 sc T + td T sa n 1,n + ta n,n Solve a system. n 1 first rows are small last row is large C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
165 Size dimension trade-offs Hermite normal form AddCol Idea: replace the last row by a random small one w T. [ ] [ B w T y = b a n 1,n 1 Let {k} be a basis of the kernel of B. Then where x = y + αk. α = a n 1,n 1 (sc T + td T ) y (sc T + td T ) k limits the expensive arithmetic to a few dot products ] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
166 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form Definition Unique F = U 1 AU = Diag(C f0,..., C fk ) with f k f k 1... f 0. C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
167 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k-shifted form: k k k C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
168 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] k + 1-shifted form: k + 1 k + 1 d l C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
169 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 k + 1-shifted form: k + 1 k + 1 d l From k to k + 1-shifted in O(k( n k )ω ) Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
170 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
171 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
172 Size dimension trade-offs Frobenius normal form Computing the Frobenius normal form [P. & Storjohann 07] 0 1 Hessenberg polycyclic: d 1 d l From k to k + 1-shifted in O(k( n k )ω ) d 2 1 Compute iteratively from a 1-shifted form Invariant factors appear by increasing degree Until the Hessenberg polycyclic form n ( ) 1 ω 1 ζ(ω 1)n ω = O(n ω ) k n ω k= Generalized to the Frobenius form as well Transformation matrix in O(n ω log log n) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
173 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
174 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig dimension = n 2 i degree = 2 i 0 1 dimension = 1 degree = n det(xi n A) 1 C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
175 Size dimension trade-offs Frobenius normal form A new type size dimension trade-off xi n A dimension = n degree = 1 Keller-Gehrig 2 New algorithm dimension = n 2 i degree = 2 i dimension = n k degree = k dimension = 1 degree = n det(xi n A) C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
176 Parallel exact linear algebra Outline 1 Choosing the underlying arithmetic Using boolean arithmetic Using machine word arithmetic Larger field sizes 2 Reductions and building blocks In dense linear algebra In blackbox linear algebra 3 Size dimension trade-offs Hermite normal form Frobenius normal form 4 Parallel exact linear algebra Ingredients for the parallelization Parallel dense linear algebra mod p C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
177 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
178 Parallel exact linear algebra Ingredients for the parallelization Parallelization Parallel numerical linear algebra cost invariant wrt. splitting O(n 3 ) fine grain block iterative algorithms regular task load Numerical stability constraints Exact linear algebra specificities cost affected by the splitting O(n ω ) for ω < 3 modular reductions coarse grain recursive algorithms rank deficiencies unbalanced task loads C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
179 Parallel exact linear algebra Ingredients for the parallelization Ingredients for the parallelization Criteria good performances portability across architectures abstraction for simplicity Challenging key point: scheduling as a plugin Program: only describes where the parallelism lies Runtime: scheduling & mapping, depending on the context of execution 3 main models: 1 Parallel loop [data parallelism] 2 Fork-Join (independent tasks) [task parallelism] 3 Dependent tasks with data flow dependencies [task parallelism] C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
180 Parallel exact linear algebra Ingredients for the parallelization Data Parallelism OMP for ( int step = 0; step < 2; ++ step ){ # pragma omp parallel for for ( int i = 0; i < count ; ++ i) A[i] = (B[i +1] + B[i -1] + 2.0* B[i ])*0.25; } Limitation: very un-efficient with recursive parallel regions Limited to iterative algorithms No composition of routines C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
181 Parallel exact linear algebra Ingredients for the parallelization Task parallelism with fork-join Task based program: spawn + sync Especially suited for recursive programs OMP (since v3) void fibonacci ( long * result, long n) { if (n < 2) * result = n; else { long x,y; # pragma omp task fibonacci ( &x, n -1 ); fibonacci ( &y, n -2 ); # pragma omp taskwait * result = x + y; } } C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
182 Parallel exact linear algebra Ingredients for the parallelization Tasks with dataflow dependencies Task based model avoiding synchronizations Infer synchronizations from the read/write specifications A task is ready for execution when all its inputs variables are ready A variable is ready when it has been written Recently supported: Athapascan [96], Kaapi [06], StarSs [07], StarPU [08], Quark [10], OMP since v4 [14]... C. Pernet Exact Linear Algebra Algorithmic July 6, / 73
High Performance and Reliable Algebraic Computing
High Performance and Reliable Algebraic Computing Soutenance d habilitation à diriger des recherches Clément Pernet Université Joseph Fourier (Grenoble 1) November 25, 2014 Rapporteurs : Daniel Augot,
More informationEliminations and echelon forms in exact linear algebra
Eliminations and echelon forms in exact linear algebra Clément PERNET, INRIA-MOAIS, Grenoble Université, France East Coast Computer Algebra Day, University of Waterloo, ON, Canada, April 9, 2011 Clément
More informationToward High Performance Matrix Multiplication for Exact Computation
Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014 Motivations
More informationLinear Algebra 1: Computing canonical forms in exact linear algebra
Linear Algebra : Computing canonical forms in exact linear algebra Clément PERNET, LIG/INRIA-MOAIS, Grenoble Université, France ECRYPT II: Summer School on Tools, Mykonos, Grece, June st, 202 Introduction
More informationhal , version 1-29 Feb 2008
Compressed Modular Matrix Multiplication Jean-Guillaume Dumas Laurent Fousse Bruno Salvy February 29, 2008 Abstract We propose to store several integers modulo a small prime into a single machine word.
More informationAlgorithms for exact (dense) linear algebra
Algorithms for exact (dense) linear algebra Gilles Villard CNRS, Laboratoire LIP ENS Lyon Montagnac-Montpezat, June 3, 2005 Introduction Problem: Study of complexity estimates for basic problems in exact
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationComputing the Rank Profile Matrix
Computing the Rank Profile Matrix Jean-Guillaume Dumas, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Clément Pernet, Ziad Sultan. Computing the Rank Profile Matrix. Kazuhiro
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationFast Matrix Multiplication Over GF3
Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015 Matrix Multiplication Outline Fast Matrix Multiplication Outline Fast Matrix
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationThree Ways to Test Irreducibility
Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 12 Feb 2009 Outline Polynomials over finite fields Irreducibility criteria
More informationThree Ways to Test Irreducibility
Outline Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 8 Dec 2008 Polynomials over finite fields Irreducibility criteria
More informationAlgorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis
Algorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis Martin Albrecht M.R.Albrecht@rhul.ac.uk Thesis submitted to Royal Holloway, University of London for the degree of Doctor
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationMatrix multiplication over word-size prime fields using Bini s approximate formula
Matrix multiplication over word-size prime fields using Bini s approximate formula Brice Boyer, Jean-Guillaume Dumas To cite this version: Brice Boyer, Jean-Guillaume Dumas. Matrix multiplication over
More informationarxiv: v2 [cs.sc] 14 May 2018
Fast Computation of the Rank Profile Matrix and the Generalized Bruhat Decomposition arxiv:1601.01798v2 [cs.sc 14 May 2018 Jean-Guillaume Dumas Université Grenoble Alpes, Laboratoire LJK, umr CNRS, BP53X,
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationQ-adic Transform revisited
Q-adic Transform revisited Jean-Guillaume Dumas To cite this version: Jean-Guillaume Dumas. Q-adic Transform revisited. 2007. HAL Id: hal-00173894 https://hal.archives-ouvertes.fr/hal-00173894v3
More informationOn the use of polynomial matrix approximant in the block Wiedemann algorithm
On the use of polynomial matrix approximant in the block Wiedemann algorithm ECOLE NORMALE SUPERIEURE DE LYON Laboratoire LIP, ENS Lyon, France Pascal Giorgi Symbolic Computation Group, School of Computer
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationThe Hash Function JH 1
The Hash Function JH 1 16 January, 2011 Hongjun Wu 2,3 wuhongjun@gmail.com 1 The design of JH is tweaked in this report. The round number of JH is changed from 35.5 to 42. This new version may be referred
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationImplementation of the DKSS Algorithm for Multiplication of Large Numbers
Implementation of the DKSS Algorithm for Multiplication of Large Numbers Christoph Lüders Universität Bonn The International Symposium on Symbolic and Algebraic Computation, July 6 9, 2015, Bath, United
More informationDense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages
Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages Jean-Guillaume Dumas, Thierry Gautier, Pascal Giorgi, Clément Pernet To cite this version: Jean-Guillaume Dumas, Thierry Gautier,
More informationTunable Floating-Point for Energy Efficient Accelerators
Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More informationInverting integer and polynomial matrices. Jo60. Arne Storjohann University of Waterloo
Inverting integer and polynomial matrices Jo60 Arne Storjohann University of Waterloo Integer Matrix Inverse Input: An n n matrix A filled with entries of size d digits. Output: The matrix inverse A 1.
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationSolving Sparse Rational Linear Systems. Pascal Giorgi. University of Waterloo (Canada) / University of Perpignan (France) joint work with
Solving Sparse Rational Linear Systems Pascal Giorgi University of Waterloo (Canada) / University of Perpignan (France) joint work with A. Storjohann, M. Giesbrecht (University of Waterloo), W. Eberly
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationAn introduction to parallel algorithms
An introduction to parallel algorithms Knut Mørken Department of Informatics Centre of Mathematics for Applications University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 1/26
More informationMcBits: Fast code-based cryptography
McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationCover and Decomposition Index Calculus on Elliptic Curves made practical
Cover and Decomposition Index Calculus on Elliptic Curves made practical Application to a previously unreachable curve over F p 6 Vanessa VITSE Antoine JOUX Université de Versailles Saint-Quentin, Laboratoire
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationSoftware implementation of ECC
Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia Software implementation of (H)ECC Radboud University,
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationYale university technical report #1402.
The Mailman algorithm: a note on matrix vector multiplication Yale university technical report #1402. Edo Liberty Computer Science Yale University New Haven, CT Steven W. Zucker Computer Science and Appled
More informationLINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,
LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 74 6 Summary Here we summarize the most important information about theoretical and numerical linear algebra. MORALS OF THE STORY: I. Theoretically
More informationA variant of the F4 algorithm
A variant of the F4 algorithm Vanessa VITSE - Antoine JOUX Université de Versailles Saint-Quentin, Laboratoire PRISM CT-RSA, February 18, 2011 Motivation Motivation An example of algebraic cryptanalysis
More informationWhy solve integer linear systems exactly? David Saunders - U of Delaware
NSF CDI, 2007Oct30 1 Why solve integer linear systems exactly?... an enabling tool for discovery and innovation? David Saunders - U of Delaware LinBox library at linalg.org NSF CDI, 2007Oct30 2 LinBoxers
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationCHAPMAN & HALL/CRC CRYPTOGRAPHY AND NETWORK SECURITY ALGORITHMIC CR YPTAN ALY51S. Ant nine J aux
CHAPMAN & HALL/CRC CRYPTOGRAPHY AND NETWORK SECURITY ALGORITHMIC CR YPTAN ALY51S Ant nine J aux (g) CRC Press Taylor 8* Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor &
More informationSoftware implementation of Koblitz curves over quadratic fields
Software implementation of Koblitz curves over quadratic fields Thomaz Oliveira 1, Julio López 2 and Francisco Rodríguez-Henríquez 1 1 Computer Science Department, Cinvestav-IPN 2 Institute of Computing,
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationFaster arithmetic for number-theoretic transforms
University of New South Wales 7th October 2011, Macquarie University Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationPrime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation
Galois field The issue Topic: finite fields with word size cardinality Field: 4 arithmetic operators to implement (+, -, *, /) We will focus on axpy: r = a x + y (operation mainly used in linear algebra
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationChenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015
GSKS GSKNN BLIS-Based High Performance Computing Kernels in N-body Problems Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015 N-body Problems Hellstorm Astronomy and 3D https://youtu.be/bllwkx_mrfk 2 N-body
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationLattice reduction of polynomial matrices
Lattice reduction of polynomial matrices Arne Storjohann David R. Cheriton School of Computer Science University of Waterloo Presented at the SIAM conference on Applied Algebraic Geometry at the Colorado
More informationInformation redundancy
Information redundancy Information redundancy add information to date to tolerate faults error detecting codes error correcting codes data applications communication memory p. 2 - Design of Fault Tolerant
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationJames Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev
. Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley
More informationMaximal Matching via Gaussian Elimination
Maximal Matching via Gaussian Elimination Piotr Sankowski Uniwersytet Warszawski - p. 1/77 Outline Maximum Matchings Lovásza s Idea, Maximum matchings, Rabin and Vazirani, Gaussian Elimination, Simple
More informationComputation of dot products in finite fields with floating-point arithmetic
Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs
More informationFast Lattice-Based Encryption: Stretching SPRING
Fast Lattice-Based Encryption: Stretching SPRING Charles Bouillaguet 1 Claire Delaplace 1,2 Pierre-Alain Fouque 2 Paul Kirchner 3 1 CFHP team, CRIStAL, Université de Lille, France 2 EMSEC team, IRISA,
More informationResidue Number Systems Ivor Page 1
Residue Number Systems 1 Residue Number Systems Ivor Page 1 7.1 Arithmetic in a modulus system The great speed of arithmetic in Residue Number Systems (RNS) comes from a simple theorem from number theory:
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationOutput-sensitive algorithms for sumset and sparse polynomial multiplication
Output-sensitive algorithms for sumset and sparse polynomial multiplication Andrew Arnold Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Daniel S. Roche Computer Science
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationALU A functional unit
ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1
More informationFast computation of normal forms of polynomial matrices
1/25 Fast computation of normal forms of polynomial matrices Vincent Neiger Inria AriC, École Normale Supérieure de Lyon, France University of Waterloo, Ontario, Canada Partially supported by the mobility
More informationOptimizing Scientific Libraries for the Itanium
0 Optimizing Scientific Libraries for the Itanium John Harrison Intel Corporation Gelato Federation Meeting, HP Cupertino May 25, 2005 1 Quick summary Intel supplies drop-in replacement versions of common
More informationComputing Characteristic Polynomials of Matrices of Structured Polynomials
Computing Characteristic Polynomials of Matrices of Structured Polynomials Marshall Law and Michael Monagan Department of Mathematics Simon Fraser University Burnaby, British Columbia, Canada mylaw@sfu.ca
More informationRoundoff Error. Monday, August 29, 11
Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate
More informationFormal verification of IA-64 division algorithms
Formal verification of IA-64 division algorithms 1 Formal verification of IA-64 division algorithms John Harrison Intel Corporation IA-64 overview HOL Light overview IEEE correctness Division on IA-64
More informationNote Set 3 Numerical Linear Algebra
Note Set 3 Numerical Linear Algebra 3.1 Overview Numerical linear algebra includes a class of problems which involves multiplying vectors and matrices, solving linear equations, computing the inverse of
More informationElliptic Curve Discrete Logarithm Problem
Elliptic Curve Discrete Logarithm Problem Vanessa VITSE Université de Versailles Saint-Quentin, Laboratoire PRISM October 19, 2009 Vanessa VITSE (UVSQ) Elliptic Curve Discrete Logarithm Problem October
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationCBEAM: Ecient Authenticated Encryption from Feebly One-Way φ Functions
CBEAM: Ecient Authenticated Encryption from Feebly One-Way φ Functions Author: Markku-Juhani O. Saarinen Presented by: Jean-Philippe Aumasson CT-RSA '14, San Francisco, USA 26 February 2014 1 / 19 Sponge
More informationANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3
ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ISSUED 24 FEBRUARY 2018 1 Gaussian elimination Let A be an (m n)-matrix Consider the following row operations on A (1) Swap the positions any
More informationHecke Operators for Arithmetic Groups via Cell Complexes. Mark McConnell. Center for Communications Research, Princeton
Hecke Operators for Arithmetic Groups via Cell Complexes 1 Hecke Operators for Arithmetic Groups via Cell Complexes Mark McConnell Center for Communications Research, Princeton Hecke Operators for Arithmetic
More informationImplementing Fast Carryless Multiplication
Implementing Fast Carryless Multiplication Joris van der Hoeven, Robin Larrieu and Grégoire Lecerf CNRS & École polytechnique MACIS 2017 Nov. 15, Vienna, Austria van der Hoeven, Larrieu, Lecerf Implementing
More information9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017
9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin
More informationLinear Systems of n equations for n unknowns
Linear Systems of n equations for n unknowns In many application problems we want to find n unknowns, and we have n linear equations Example: Find x,x,x such that the following three equations hold: x
More information- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline
- Part 4 - Multicore and Manycore Technology: Chances and Challenges Vincent Heuveline 1 Numerical Simulation of Tropical Cyclones Goal oriented adaptivity for tropical cyclones ~10⁴km ~1500km ~100km 2
More informationFast reversion of formal power series
Fast reversion of formal power series Fredrik Johansson LFANT, INRIA Bordeaux RAIM, 2016-06-29, Banyuls-sur-mer 1 / 30 Reversion of power series F = exp(x) 1 = x + x 2 2! + x 3 3! + x 4 G = log(1 + x)
More informationParallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication
Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationGF(2 m ) arithmetic: summary
GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation
More informationMapping Sparse Matrix-Vector Multiplication on FPGAs
Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007 Outline Introduction
More informationBlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas
BlockMatrixComputations and the Singular Value Decomposition ATaleofTwoIdeas Charles F. Van Loan Department of Computer Science Cornell University Supported in part by the NSF contract CCR-9901988. Block
More informationChapter 2. Divide-and-conquer. 2.1 Strassen s algorithm
Chapter 2 Divide-and-conquer This chapter revisits the divide-and-conquer paradigms and explains how to solve recurrences, in particular, with the use of the master theorem. We first illustrate the concept
More informationThis ensures that we walk downhill. For fixed λ not even this may be the case.
Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant
More informationPolynomial multiplication and division using heap.
Polynomial multiplication and division using heap. Michael Monagan and Roman Pearce Department of Mathematics, Simon Fraser University. Abstract We report on new code for sparse multivariate polynomial
More informationCPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication
CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication March, 2006 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform
More informationHomomorphic Evaluation of the AES Circuit
Homomorphic Evaluation of the AES Circuit IBM Research and University Of Bristol. August 22, 2012 Homomorphic Evaluation of the AES Circuit Slide 1 Executive Summary We present a working implementation
More informationModeling and Tuning Parallel Performance in Dense Linear Algebra
Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More information