Fast Matrix Multiplication Over GF3
|
|
- Randolph Thornton
- 5 years ago
- Views:
Transcription
1 Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015
2 Matrix Multiplication
3 Outline Fast Matrix Multiplication
4 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm
5 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm Method of the Four Russians (Arlazarov et al., 1970): O( n3 log(n) ) algorithm that is effective over small finite fields (GF2, GF3, GF5). Very well studied for GF2 and anecdotally good for GF3.
6 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm Method of the Four Russians (Arlazarov et al., 1970): O( n3 log(n) ) algorithm that is effective over small finite fields (GF2, GF3, GF5). Very well studied for GF2 and anecdotally good for GF3. Aim to find thresholds for each algorithm over GF3.
7 Arithmetic over GF3 Simply arithmetic mod 3.
8 Arithmetic over GF3 Simply arithmetic mod
9 Arithmetic over GF3 Simply arithmetic mod
10 Representation of GF3 With only three possible values we ideally need two bits for each element.
11 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively
12 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words.
13 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words. Bitsliced storage: low bits and high bits stored separately (Boothby & Bradshaw, 2009). Store 2 as 11 2 instead of
14 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words. Bitsliced storage: low bits and high bits stored separately (Boothby & Bradshaw, 2009). Store 2 as 11 2 instead of bit ops to add or subtract two 64-element SlicedUnits. 1 bit op needed to negate.
15 Matrix Multiplication over GF = Row i of A determines row i of C
16 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j
17 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation
18 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation We perform up to n 2 row additions, yielding O(n 3 ) running time.
19 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation We perform up to n 2 row additions, yielding O(n 3 ) running time. We have a speedup if we do not have to do O(n 2 ) row additions.
20 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time.
21 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time. With multiple bits as index, we are adding multiple rows at once.
22 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time. With multiple bits as index, we are adding multiple rows at once. We need to quickly compute the linear combinations of the t = 2 rows we are adding.
23 Four Russians Multiplication over GF
24 Four Russians Multiplication over GF index description contents r r r 1 + r
25 Four Russians Multiplication over GF index description contents r r r 1 + r
26 Four Russians Multiplication over GF index description contents r r r 3 + r =
27 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables.
28 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time.
29 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables.
30 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r 1
31 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r 1 01 r 2 11 r 1 + r 2
32 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r r r 1 + r r r 1 + r r 2 + r r 1 + r 2 + r 3
33 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end
34 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end Running time: n2 t vector adds, totaling O( n3 t ) time; n t tables created, totaling O( 2t n 2 t ) time.
35 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end Running time: n2 t vector adds, totaling O( n3 t ) time; n t tables created, totaling O( 2t n 2 t ) time. n Let t = log 2 (n), then the additions take O( 3 log 2 (n)) time and n the tables take O( 3 log 2 (n)) time to create.
36 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows.
37 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3.
38 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )?
39 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111.
40 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111. Too expensive to map index to range 0 to 3 t directly, so we concatenate the high and low bits to give an index between 0 and 4 t.
41 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111. Too expensive to map index to range 0 to 3 t directly, so we concatenate the high and low bits to give an index between 0 and 4 t. How do our tables look with these indices?
42 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem.
43 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice.
44 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice. Use one table of size 4 t rows and add with it once.
45 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice. Use one table of size 4 t rows and add with it once. Use one table of size 3 t rows and one of size 4 t indices and add with it once.
46 Four Russians Multiplication over GF3: 2 t approach First approach: create a table of 2 t combinations of rows as in GF2 case. Index once into the table with the low t bits and once with the high t bits. The 2 = 11 2 representation used in the bitsliced storage is advantageous.
47 Four Russians Multiplication over GF3: 2 t approach First approach: create a table of 2 t combinations of rows as in GF2 case. Index once into the table with the low t bits and once with the high t bits. The 2 = 11 2 representation used in the bitsliced storage is advantageous. index contents r 1 01 r 2 11 r 1 + r 2
48 Four Russians Multiplication over GF3: 4 t approach Second approach: create a table of size 4 t rows containing all 3 t combinations of rows and some unused rows. We will index into the rows directly.
49 Four Russians Multiplication over GF3: 4 t approach Second approach: create a table of size 4 t rows containing all 3 t combinations of rows and some unused rows. We will index into the rows directly. index contents r r r 1 + r r r 1 + r 2 index contents r r 1 r r 1 r 2
50 Four Russians Multiplication over GF3: 3 t approach Third approach: create a table of 3 t rows and a table of 4 t indices or pointers to map an index to a combination.
51 Four Russians Multiplication over GF3: 3 t approach Third approach: create a table of 3 t rows and a table of 4 t indices or pointers to map an index to a combination. index destination index contents r r r r 1 + r r 1 + r r r 1 r r 1 r
52 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n
53 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n Because the 3 t and 4 t approaches contain ± each combination of rows, we only need to use the six operation addition to construct half of the elements. We can then use the one operation negation to construct the other half.
54 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n Because the 3 t and 4 t approaches contain ± each combination of rows, we only need to use the six operation addition to construct half of the elements. We can then use the one operation negation to construct the other half. Only one supplemental table for the 3 t approach needs to be created for each value of t used.
55 Four Russians Multiplication over GF3: Results Initial development showed 4 t method to be considerably slower, so it was abandoned in favor of optimizing the other two approaches.
56 Four Russians Multiplication over GF3: Results Initial development showed 4 t method to be considerably slower, so it was abandoned in favor of optimizing the other two approaches.
57 Four Russians Multiplication over GF3: Results n 3 t time 2 t time classical time
58 Four Russians Multiplication over GF3: Results n 3 t time 2 t time classical time
59 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving)
60 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be.
61 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables?
62 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost?
63 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost? Some precedent for multiple additions: M4RI (Albrecht, 2010) increases number of additions if it means smaller tables, so multiple tables can fit into L1 cache.
64 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost? Some precedent for multiple additions: M4RI (Albrecht, 2010) increases number of additions if it means smaller tables, so multiple tables can fit into L1 cache. Preliminary testing with multiple tables sees 3 t approach compute in seconds, and 2 t approach in 1.01 seconds.
65 Other Four Russians Thoughts We assume we can operate on full 64-bit words. Performance is decreased if we must not modify unused bits.
66 Other Four Russians Thoughts We assume we can operate on full 64-bit words. Performance is decreased if we must not modify unused bits. If t = log(n), then the table of row combinations is the same size as the matrix so there are definite practical limitations.
67 Strassen s Algorithm vs Classical Divide-and-Conquer Classical divide-and-conquer multiplication vs Strassen multiplication on top of 3 t approach of the Method of the Four Russians. Strassen was faster at all tested dimensions.
68 Strassen s Algorithm: 2 t vs 3 t as base case Three levels of recursion performed. Even with base cases where the 3 t approach should outperform 2 t approach, the 2 t base case yielded faster results in all cases (though the multiple tables method outperforms 2 t by about 14%).
69 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians.
70 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and 1728.
71 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and Normally Strassen s algorithm works best on matrices with dimensions a power of 2 or a power of 2 times the base case dimension. With bitsliced or packed storage it is important to try to maintain matrices with dimensions multiples of 64 at all levels of recursion, as working with less than 1 machine word is more expensive.
72 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and Normally Strassen s algorithm works best on matrices with dimensions a power of 2 or a power of 2 times the base case dimension. With bitsliced or packed storage it is important to try to maintain matrices with dimensions multiples of 64 at all levels of recursion, as working with less than 1 machine word is more expensive. Further improvements ongoing?
73 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication.
74 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication. Implemented Strassen s algorithm on top of the Method of the Four Russians.
75 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication. Implemented Strassen s algorithm on top of the Method of the Four Russians. Different implementations of the Method of the Four Russians might be preferred depending on specific problem and machine.
76 Best value of t For GF2 we can assume all operations are essentially equal in time: only different cost is extracting bits. All other costs are 64-bit xors. For (m n) (n k) = (m k), we spend mnk/t time adding and nk2 t /t time creating tables. Minimize mnk t + k2t t. Partial w.r.t. t: kn(2t (t log(2) 1) m). t 2 Set equal to zero and solve for t = W (m/e)+1 log(2). dimension t log 2 (dim)
77 Selecting t in GF3 For 3 t approach: we spend mnk/t time adding and nk3 t /t time creating tables. t = W (m/e)+1 log(3). For 2 t approach: we spend 2mnk/t time adding and nk2 t /t time creating tables. t = W (2m/e)+1 log(2) dimension t min t experimental t log 3 (dim) dimension t min t experimental t log 2 (dim)
78 GF3 bitsliced operations s x 0 y 1 t x 1 y 0 r 0 (x 0 y 1 ) (x 1 y 0 ) r 1 s t Figure: GF3 bitsliced addition in six operations: r x + y t x 0 y 0 r 0 t (x 1 y 1 ) r 1 (t y 1 ) (y 0 x 1 ) Figure: GF3 bitsliced subtraction in six operations: r x y r 0 x 0 r 1 x 0 x 1 Figure: GF3 bitsliced negation in one operation: r x
Matrix Multiplication
Matrix Multiplication Matrix Multiplication Matrix multiplication. Given two n-by-n matrices A and B, compute C = AB. n c ij = a ik b kj k=1 c 11 c 12 c 1n c 21 c 22 c 2n c n1 c n2 c nn = a 11 a 12 a 1n
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More informationSpeeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago
Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR 0716498 Part I. Linear maps Consider computing 0
More informationMatrices. Chapter Definitions and Notations
Chapter 3 Matrices 3. Definitions and Notations Matrices are yet another mathematical object. Learning about matrices means learning what they are, how they are represented, the types of operations which
More informationEliminations and echelon forms in exact linear algebra
Eliminations and echelon forms in exact linear algebra Clément PERNET, INRIA-MOAIS, Grenoble Université, France East Coast Computer Algebra Day, University of Waterloo, ON, Canada, April 9, 2011 Clément
More informationInteger factorization, part 1: the Q sieve. part 2: detecting smoothness. D. J. Bernstein
Integer factorization, part 1: the Q sieve Integer factorization, part 2: detecting smoothness D. J. Bernstein The Q sieve factors by combining enough -smooth congruences ( + ). Enough log. Plausible conjecture:
More informationCalculating Algebraic Signatures Thomas Schwarz, S.J.
Calculating Algebraic Signatures Thomas Schwarz, S.J. 1 Introduction A signature is a small string calculated from a large object. The primary use of signatures is the identification of objects: equal
More informationFast reversion of power series
Fast reversion of power series Fredrik Johansson November 2011 Overview Fast power series arithmetic Fast composition and reversion (Brent and Kung, 1978) A new algorithm for reversion Implementation results
More informationIntroduction to Digital Logic Missouri S&T University CPE 2210 Subtractors
Introduction to Digital Logic Missouri S&T University CPE 2210 Egemen K. Çetinkaya Egemen K. Çetinkaya Department of Electrical & Computer Engineering Missouri University of Science and Technology cetinkayae@mst.edu
More informationChapter 2. Divide-and-conquer. 2.1 Strassen s algorithm
Chapter 2 Divide-and-conquer This chapter revisits the divide-and-conquer paradigms and explains how to solve recurrences, in particular, with the use of the master theorem. We first illustrate the concept
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More informationMCR3U Unit 7 Lesson Notes
7.1 Arithmetic Sequences Sequence: An ordered list of numbers identified by a pattern or rule that may stop at some number or continue indefinitely. Ex. 1, 2, 4, 8,... Ex. 3, 7, 11, 15 Term (of a sequence):
More informationCMPSCI611: Three Divide-and-Conquer Examples Lecture 2
CMPSCI611: Three Divide-and-Conquer Examples Lecture 2 Last lecture we presented and analyzed Mergesort, a simple divide-and-conquer algorithm. We then stated and proved the Master Theorem, which gives
More informationLecture 8: Number theory
KTH - Royal Institute of Technology NADA, course: 2D1458 Problem solving and programming under pressure Autumn 2005 for Fredrik Niemelä Authors: Johnne Adermark and Jenny Melander, 9th Nov 2005 Lecture
More informationClass Note #14. In this class, we studied an algorithm for integer multiplication, which. 2 ) to θ(n
Class Note #14 Date: 03/01/2006 [Overall Information] In this class, we studied an algorithm for integer multiplication, which improved the running time from θ(n 2 ) to θ(n 1.59 ). We then used some of
More informationGF(2 m ) arithmetic: summary
GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation
More informationCSE 20 DISCRETE MATH. Fall
CSE 20 DISCRETE MATH Fall 2017 http://cseweb.ucsd.edu/classes/fa17/cse20-ab/ Today's learning goals Describe and use algorithms for integer operations based on their expansions Relate algorithms for integer
More informationAn Arithmetic Sequence can be defined recursively as. a 1 is the first term and d is the common difference where a 1 and d are real numbers.
Section 12 2A: Arithmetic Sequences An arithmetic sequence is a sequence that has a constant ( labeled d ) added to the first term to get the second term and that same constant is then added to the second
More informationYale university technical report #1402.
The Mailman algorithm: a note on matrix vector multiplication Yale university technical report #1402. Edo Liberty Computer Science Yale University New Haven, CT Steven W. Zucker Computer Science and Appled
More informationAlgorithms for Collective Communication. Design and Analysis of Parallel Algorithms
Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all
More informationAlgorithms and Data Structures Strassen s Algorithm. ADS (2017/18) Lecture 4 slide 1
Algorithms and Data Structures Strassen s Algorithm ADS (2017/18) Lecture 4 slide 1 Tutorials Start in week (week 3) Tutorial allocations are linked from the course webpage http://www.inf.ed.ac.uk/teaching/courses/ads/
More informationPrime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation
Galois field The issue Topic: finite fields with word size cardinality Field: 4 arithmetic operators to implement (+, -, *, /) We will focus on axpy: r = a x + y (operation mainly used in linear algebra
More informationA block cipher enciphers each block with the same key.
Ciphers are classified as block or stream ciphers. All ciphers split long messages into blocks and encipher each block separately. Block sizes range from one bit to thousands of bits per block. A block
More informationSubquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation
Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation M A Hasan and C Negre Abstract We study Dickson bases for binary field representation Such representation
More informationComplexity Theory of Polynomial-Time Problems
Complexity Theory of Polynomial-Time Problems Lecture 1: Introduction, Easy Examples Karl Bringmann and Sebastian Krinninger Audience no formal requirements, but: NP-hardness, satisfiability problem, how
More informationCS 577 Introduction to Algorithms: Strassen s Algorithm and the Master Theorem
CS 577 Introduction to Algorithms: Jin-Yi Cai University of Wisconsin Madison In the last class, we described InsertionSort and showed that its worst-case running time is Θ(n 2 ). Check Figure 2.2 for
More informationConsider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms
More informationCISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata
CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences
More informationAlgorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis
Algorithmic Algebraic Techniques and their Application to Block Cipher Cryptanalysis Martin Albrecht M.R.Albrecht@rhul.ac.uk Thesis submitted to Royal Holloway, University of London for the degree of Doctor
More informationCSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm
CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of
More informationDiscrete Logarithm Problem
Discrete Logarithm Problem Çetin Kaya Koç koc@cs.ucsb.edu (http://cs.ucsb.edu/~koc/ecc) Elliptic Curve Cryptography lect08 discrete log 1 / 46 Exponentiation and Logarithms in a General Group In a multiplicative
More informationPolynomial multiplication and division using heap.
Polynomial multiplication and division using heap. Michael Monagan and Roman Pearce Department of Mathematics, Simon Fraser University. Abstract We report on new code for sparse multivariate polynomial
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationarxiv: v1 [cs.sc] 17 Apr 2013
EFFICIENT CALCULATION OF DETERMINANTS OF SYMBOLIC MATRICES WITH MANY VARIABLES TANYA KHOVANOVA 1 AND ZIV SCULLY 2 arxiv:1304.4691v1 [cs.sc] 17 Apr 2013 Abstract. Efficient matrix determinant calculations
More informationTutorials. Algorithms and Data Structures Strassen s Algorithm. The Master Theorem for solving recurrences. The Master Theorem (cont d)
DS 2018/19 Lecture 4 slide 3 DS 2018/19 Lecture 4 slide 4 Tutorials lgorithms and Data Structures Strassen s lgorithm Start in week week 3 Tutorial allocations are linked from the course webpage http://www.inf.ed.ac.uk/teaching/courses/ads/
More informationEfficient Enumeration of Regular Languages
Efficient Enumeration of Regular Languages Margareta Ackerman and Jeffrey Shallit University of Waterloo, Waterloo ON, Canada mackerma@uwaterloo.ca, shallit@graceland.uwaterloo.ca Abstract. The cross-section
More informationDirect Construction of Recursive MDS Diffusion Layers using Shortened BCH Codes. Daniel Augot and Matthieu Finiasz
Direct Construction of Recursive MDS Diffusion Layers using Shortened BCH Codes Daniel Augot and Matthieu Finiasz Context Diffusion layers in a block cipher/spn should: obviously, offer good diffusion,
More informationA Linear Time Algorithm for Ordered Partition
A Linear Time Algorithm for Ordered Partition Yijie Han School of Computing and Engineering University of Missouri at Kansas City Kansas City, Missouri 64 hanyij@umkc.edu Abstract. We present a deterministic
More informationEfficient random number generation on FPGA-s
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 313 320 doi: 10.14794/ICAI.9.2014.1.313 Efficient random number generation
More information! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.
Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer.! Break up problem into several parts.! Solve each part recursively.! Combine solutions to sub-problems into overall solution. Most common
More informationCopyright 2000, Kevin Wayne 1
Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively. Combine solutions to sub-problems into overall solution. Most common
More informationSpeedy Maths. David McQuillan
Speedy Maths David McQuillan Basic Arithmetic What one needs to be able to do Addition and Subtraction Multiplication and Division Comparison For a number of order 2 n n ~ 100 is general multi precision
More informationThe Master Theorem for solving recurrences. Algorithms and Data Structures Strassen s Algorithm. Tutorials. The Master Theorem (cont d)
The Master Theorem for solving recurrences lgorithms and Data Structures Strassen s lgorithm 23rd September, 2014 Theorem Let n 0 N, k N 0 and a, b R with a > 0 and b > 1, and let T : N R satisfy the following
More informationThree Ways to Test Irreducibility
Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 12 Feb 2009 Outline Polynomials over finite fields Irreducibility criteria
More informationRandom Number Generation. Stephen Booth David Henty
Random Number Generation Stephen Booth David Henty Introduction Random numbers are frequently used in many types of computer simulation Frequently as part of a sampling process: Generate a representative
More informationThree Ways to Test Irreducibility
Outline Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 8 Dec 2008 Polynomials over finite fields Irreducibility criteria
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More informationSquare Always Exponentiation
Square Always Exponentiation Christophe Clavier 1 Benoit Feix 1,2 Georges Gagnerot 1,2 Mylène Roussellet 2 Vincent Verneuil 2,3 1 XLIM-Université de Limoges, France 2 INSIDE Secure, Aix-en-Provence, France
More informationOutline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014
Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro
More information1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p
UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences Study Homework: Arithmetic NTU IC54CA (Fall 2004) SOLUTIONS Short adders A The delay of the ripple
More informationParallelism and Machine Models
Parallelism and Machine Models Andrew D Smith University of New Brunswick, Fredericton Faculty of Computer Science Overview Part 1: The Parallel Computation Thesis Part 2: Parallelism of Arithmetic RAMs
More informationLecture 19: The Determinant
Math 108a Professor: Padraic Bartlett Lecture 19: The Determinant Week 10 UCSB 2013 In our last class, we talked about how to calculate volume in n-dimensions Specifically, we defined a parallelotope:
More informationBindel, Fall 2016 Matrix Computations (CS 6210) Notes for
1 Logistics Notes for 2016-08-26 1. Our enrollment is at 50, and there are still a few students who want to get in. We only have 50 seats in the room, and I cannot increase the cap further. So if you are
More informationHardware Design I Chap. 4 Representative combinational logic
Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload
More informationReductions, Recursion and Divide and Conquer
Chapter 5 Reductions, Recursion and Divide and Conquer CS 473: Fundamental Algorithms, Fall 2011 September 13, 2011 5.1 Reductions and Recursion 5.1.0.1 Reduction Reducing problem A to problem B: (A) Algorithm
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationCS227-Scientific Computing. Lecture 4: A Crash Course in Linear Algebra
CS227-Scientific Computing Lecture 4: A Crash Course in Linear Algebra Linear Transformation of Variables A common phenomenon: Two sets of quantities linearly related: y = 3x + x 2 4x 3 y 2 = 2.7x 2 x
More informationSOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS We want to solve the linear system a, x + + a,n x n = b a n, x + + a n,n x n = b n This will be done by the method used in beginning algebra, by successively eliminating unknowns
More informationMultiplicative Complexity Reductions in Cryptography and Cryptanalysis
Multiplicative Complexity Reductions in Cryptography and Cryptanalysis THEODOSIS MOUROUZIS SECURITY OF SYMMETRIC CIPHERS IN NETWORK PROTOCOLS - ICMS - EDINBURGH 25-29 MAY/2015 1 Presentation Overview Linearity
More informationSparse Polynomial Multiplication and Division in Maple 14
Sparse Polynomial Multiplication and Division in Maple 4 Michael Monagan and Roman Pearce Department of Mathematics, Simon Fraser University Burnaby B.C. V5A S6, Canada October 5, 9 Abstract We report
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationPARALLEL MULTIPLICATION IN F 2
PARALLEL MULTIPLICATION IN F 2 n USING CONDENSED MATRIX REPRESENTATION Christophe Negre Équipe DALI, LP2A, Université de Perpignan avenue P Alduy, 66 000 Perpignan, France christophenegre@univ-perpfr Keywords:
More informationAnalysis of Algorithm Efficiency. Dr. Yingwu Zhu
Analysis of Algorithm Efficiency Dr. Yingwu Zhu Measure Algorithm Efficiency Time efficiency How fast the algorithm runs; amount of time required to accomplish the task Our focus! Space efficiency Amount
More informationWilliam Stallings Copyright 2010
A PPENDIX E B ASIC C ONCEPTS FROM L INEAR A LGEBRA William Stallings Copyright 2010 E.1 OPERATIONS ON VECTORS AND MATRICES...2 Arithmetic...2 Determinants...4 Inverse of a Matrix...5 E.2 LINEAR ALGEBRA
More informationMathematics for Engineers. Numerical mathematics
Mathematics for Engineers Numerical mathematics Integers Determine the largest representable integer with the intmax command. intmax ans = int32 2147483647 2147483647+1 ans = 2.1475e+09 Remark The set
More informationWhat s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?
What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers? Michael Monagan Center for Experimental and Constructive Mathematics Simon Fraser University British
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationToward High Performance Matrix Multiplication for Exact Computation
Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014 Motivations
More informationA Review of Matrix Analysis
Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value
More informationFields in Cryptography. Çetin Kaya Koç Winter / 30
Fields in Cryptography http://koclab.org Çetin Kaya Koç Winter 2017 1 / 30 Field Axioms Fields in Cryptography A field F consists of a set S and two operations which we will call addition and multiplication,
More informationImage Compression. 1. Introduction. Greg Ames Dec 07, 2002
Image Compression Greg Ames Dec 07, 2002 Abstract Digital images require large amounts of memory to store and, when retrieved from the internet, can take a considerable amount of time to download. The
More informationA Divide-and-Conquer Algorithm for Functions of Triangular Matrices
A Divide-and-Conquer Algorithm for Functions of Triangular Matrices Ç. K. Koç Electrical & Computer Engineering Oregon State University Corvallis, Oregon 97331 Technical Report, June 1996 Abstract We propose
More informationMod 2 linear algebra and tabulation of rational eigenforms
Mod 2 linear algebra and tabulation of rational eigenforms Kiran S. Kedlaya Department of Mathematics, University of California, San Diego kedlaya@ucsd.edu http://kskedlaya.org/slides/ (see also this SageMathCloud
More informationSEQUENCES AND SERIES
A sequence is an ordered list of numbers. SEQUENCES AND SERIES Note, in this context, ordered does not mean that the numbers in the list are increasing or decreasing. Instead it means that there is a first
More informationA field F is a set of numbers that includes the two numbers 0 and 1 and satisfies the properties:
Byte multiplication 1 Field arithmetic A field F is a set of numbers that includes the two numbers 0 and 1 and satisfies the properties: F is an abelian group under addition, meaning - F is closed under
More informationPUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science
PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) by Linxiao Wang Graduate Program in Computer Science A thesis submitted in partial fulfillment of the requirements
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationAMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems
AMS 209, Fall 205 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems. Overview We are interested in solving a well-defined linear system given
More informationMcBits: Fast code-based cryptography
McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography
More information1 The Algebraic Normal Form
1 The Algebraic Normal Form Boolean maps can be expressed by polynomials this is the algebraic normal form (ANF). The degree as a polynomial is a first obvious measure of nonlinearity linear (or affine)
More informationIntroduction to Algorithms 6.046J/18.401J/SMA5503
Introduction to Algorithms 6.046J/8.40J/SMA5503 Lecture 3 Prof. Piotr Indyk The divide-and-conquer design paradigm. Divide the problem (instance) into subproblems. 2. Conquer the subproblems by solving
More informationOn queueing in coded networks queue size follows degrees of freedom
On queueing in coded networks queue size follows degrees of freedom Jay Kumar Sundararajan, Devavrat Shah, Muriel Médard Laboratory for Information and Decision Systems, Massachusetts Institute of Technology,
More informationCMP 334: Seventh Class
CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationLinear Methods (Math 211) - Lecture 2
Linear Methods (Math 211) - Lecture 2 David Roe September 11, 2013 Recall Last time: Linear Systems Matrices Geometric Perspective Parametric Form Today 1 Row Echelon Form 2 Rank 3 Gaussian Elimination
More informationNew attacks on Keccak-224 and Keccak-256
New attacks on Keccak-224 and Keccak-256 Itai Dinur 1, Orr Dunkelman 1,2 and Adi Shamir 1 1 Computer Science department, The Weizmann Institute, Rehovot, Israel 2 Computer Science Department, University
More informationCMPS 2200 Fall Divide-and-Conquer. Carola Wenk. Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk
CMPS 2200 Fall 2017 Divide-and-Conquer Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk 1 The divide-and-conquer design paradigm 1. Divide the problem (instance)
More informationData Structures and Algorithms
Data Structures and Algorithms Spring 2017-2018 Outline 1 Sorting Algorithms (contd.) Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Analysis of Quicksort Time to sort array of length
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationAlgorithms (II) Yu Yu. Shanghai Jiaotong University
Algorithms (II) Yu Yu Shanghai Jiaotong University Chapter 1. Algorithms with Numbers Two seemingly similar problems Factoring: Given a number N, express it as a product of its prime factors. Primality:
More informationB. Cyclic Codes. Primitive polynomials are the generator polynomials of cyclic codes.
B. Cyclic Codes A cyclic code is a linear block code with the further property that a shift of a codeword results in another codeword. These are based on polynomials whose elements are coefficients from
More informationB-Spline Interpolation on Lattices
B-Spline Interpolation on Lattices David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International License.
More informationCPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication
CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform polynomial multiplication
More informationCPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication
CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication March, 2006 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform
More informationCSCI Honor seminar in algorithms Homework 2 Solution
CSCI 493.55 Honor seminar in algorithms Homework 2 Solution Saad Mneimneh Visiting Professor Hunter College of CUNY Problem 1: Rabin-Karp string matching Consider a binary string s of length n and another
More informationCSE 421 Algorithms: Divide and Conquer
CSE 42 Algorithms: Divide and Conquer Larry Ruzzo Thanks to Richard Anderson, Paul Beame, Kevin Wayne for some slides Outline: General Idea algorithm design paradigms: divide and conquer Review of Merge
More informationCSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Catie Baker Spring 2015
CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Catie Baker Spring 2015 Today Registration should be done. Homework 1 due 11:59pm next Wednesday, April 8 th. Review math
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More information