Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures

Similar documents
Practicality of Large Scale Fast Matrix Multiplication

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm

Dense Arithmetic over Finite Fields with CUMODP

On the Exponent of the All Pairs Shortest Path Problem

An introduction to parallel algorithms

Complexity of Matrix Multiplication and Bilinear Problems

Yale university technical report #1402.

1 Caveats of Parallel Algorithms

(Group-theoretic) Fast Matrix Multiplication

Dynamic Matrix Rank.

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

Powers of Tensors and Fast Matrix Multiplication

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Finding a Heaviest Triangle is not Harder than Matrix Multiplication

Introduction to Algorithms 6.046J/18.401J/SMA5503

Algorithmic Approach to Counting of Certain Types m-ary Partitions

Inner Rank and Lower Bounds for Matrix Multiplication

Fast reversion of formal power series

Parallel Numerical Algorithms

Design and Analysis of Algorithms

Matrix Multiplication

CS 577 Introduction to Algorithms: Strassen s Algorithm and the Master Theorem

Fast Matrix Product Algorithms: From Theory To Practice

A Divide-and-Conquer Algorithm for Functions of Triangular Matrices

Automatica, 33(9): , September 1997.

How to find good starting tensors for matrix multiplication

Experience in Factoring Large Integers Using Quadratic Sieve

SOLVING LINEAR SYSTEMS

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

y b where U. matrix inverse A 1 ( L. 1 U 1. L 1 U 13 U 23 U 33 U 13 2 U 12 1

Algorithm Analysis Divide and Conquer. Chung-Ang University, Jaesung Lee

Logarithmic and Exponential Equations and Change-of-Base

An O(n 3 log log n/ log n) Time Algorithm for the All-Pairs Shortest Path Problem

Ack: 1. LD Garcia, MTH 199, Sam Houston State University 2. Linear Algebra and Its Applications - Gilbert Strang

ALGORITHMS FOR CENTROSYMMETRIC AND SKEW-CENTROSYMMETRIC MATRICES. Iyad T. Abu-Jeib

Chapter 3 - From Gaussian Elimination to LU Factorization

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Krylov Subspace Methods that Are Based on the Minimization of the Residual

QR FACTORIZATIONS USING A RESTRICTED SET OF ROTATIONS

Northwest High School s Algebra 1

Dense LU factorization and its error analysis

SERIES

K K.OA.2 1.OA.2 2.OA.1 3.OA.3 4.OA.3 5.NF.2 6.NS.1 7.NS.3 8.EE.8c

Pre-Calculus I. For example, the system. x y 2 z. may be represented by the augmented matrix

THE COMPLEXITY OF THE QUATERNION PROD- UCT*

Algebra & Trig. I. For example, the system. x y 2 z. may be represented by the augmented matrix

The Accelerated Euclidean Algorithm

Numerical Linear Algebra

An Integrative Model for Parallelism

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

('')''* = 1- $302. It is common to include parentheses around negative numbers when they appear after an operation symbol.

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

Class Note #14. In this class, we studied an algorithm for integer multiplication, which. 2 ) to θ(n

ELEMENTARY LINEAR ALGEBRA

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Solving Systems of Linear Equations. Classification by Number of Solutions

Integer Multiplication

Fast reversion of power series

YOU CAN BACK SUBSTITUTE TO ANY OF THE PREVIOUS EQUATIONS

A matrix over a field F is a rectangular array of elements from F. The symbol

A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem

Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation

A polynomial expression is the addition or subtraction of many algebraic terms with positive integer powers.

Cost-Constrained Matchings and Disjoint Paths

Parallel Scientific Computing

Algebra One Dictionary

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

Primary Decompositions of Powers of Ideals. Irena Swanson

A NOTE ON STRATEGY ELIMINATION IN BIMATRIX GAMES

Group-theoretic approach to Fast Matrix Multiplication

Tight Bounds on the Diameter of Gaussian Cubes

On Strassen s Conjecture

Linear Solving for Sign Determination

Ma/CS 6b Class 12: Graphs and Matrices

25. Strassen s Fast Multiplication of Matrices Algorithm and Spreadsheet Matrix Multiplications

Fast Approximate Matrix Multiplication by Solving Linear Systems

A Fast Euclidean Algorithm for Gaussian Integers

Johns Hopkins Math Tournament Proof Round: Automata

Introduction. An Introduction to Algorithms and Data Structures

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

An Introduction to NeRDS (Nearly Rank Deficient Systems)

Implementations of 3 Types of the Schreier-Sims Algorithm

Scope and Sequence Mathematics Algebra 2 400

The Divide-and-Conquer Design Paradigm

5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns

MATH STUDENT BOOK. 11th Grade Unit 10

Math 370 Homework 2, Fall 2009

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Theory of computation: initial remarks (Chapter 11)

Enhancing Scalability of Sparse Direct Methods

Introduction to Kleene Algebras

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

MATRICES. a m,1 a m,n A =

Chapter 5 Divide and Conquer

Chapter 22 Fast Matrix Multiplication

Strassen-like algorithms for symmetric tensor contractions

Math Lecture 3 Notes

Math 2 Variable Manipulation Part 1 Algebraic Equations

Transcription:

Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures Duc Kien Nguyen 1,IvanLavallee 2,Marc Bui 2 1 CHArt -Ecole Pratique des Hautes Etudes &UniversitéParis 8, France Kien.Duc-Nguyen@univ-paris8.fr 2 LaISC -Ecole Pratique des Hautes Etudes, France Ivan.Lavallee@ephe.sorbonne.fr Marc.Bui@ephe.sorbonne.fr Abstract: Strassen s algorithm to multiply two n n matrices reduces the asymptotic operation count from O(n 3 ) of the traditional algorithm to O(n 2.81 ),thus designing efficient parallelizing for this algorithm becomes essential. In this paper, we present our generalizing of a parallel Strassen implementation which obtained a very nice performance on an Intel Paragon: faster 20% for n 1000 and more than 100% for n 5000 in comparison to the parallel traditional algorithms (as Fox, Cannon). Our method can be applied to all the matrix multiplication algorithms on distributed memory computers that use Strassen s algorithm at the system level, hence it gives us compatibility to find better parallel implementations of Strassen salgorithm. 1 Introduction Matrix multiplication (MM) is one of the most fundamental operations in linear algebra and serves as the main building block in manydifferent algorithms, including the solution of systems of linear equations, matrix inversion, evaluation of the matrix determinant and the transitive closure of a graph. In several cases the asymptotic complexities of these algorithms depend directly on the complexity of matrix multiplication -which motivates the study of possibilities to speed up matrix multiplication. Also, the inclusion of matrix multiplication in many benchmarks points at its role as a determining factor for the performance of high speed computations. Strassen was the first to introduce abetter algorithm [Str69] for MM with O(N log 2 7 )than the traditional one which needs O(N 3 ) operations. Then Winograd variant [Win71] of Strassen salgorithm has the same exponent butaslightly lower constant as the number of additions/subtractions is reduced from 18 down to 15. The record of complexity owed to Coppersmith and Winograd is O(N 2.376 ),resulted from arithmetic aggregation [CW90]. However, only Winograd s algorithm and Strassen s algorithm offer better performance than traditional algorithm for matrices of practical sizes, say, less than 10 20 [LPS92]. The 359

full potential of these algorithms can be realized only on large matrices, which require large machines such as parallel computers. Thus, designing efficient parallel implementations for theses algorithms becomes essential. This research was started when a paper by Chung-Chiang Chou, Yuefan Deng, Gang Li, and Yuan Wang [CDLW95] on the Strassen parallelizing came to our attention. Their implementation already obtained a nice performance: in comparison to the parallel traditional algorithms (as Fox, Cannon) on an Intel Paragon, it s faster 20% for n 1000 and more than 100% for n 5000. The principle of this implementation is to parallelize the Strassen s algorithm at the system level -i.e. to stop on the recursion level r of execution tree -and the calculation of the products of sub matrices is locally performed by the processors. The most significant point here is to determine the sub matrices after having recursively executed r time the Strassen s formula (these sub matrices are corresponding to the nodes of level r in the execution tree of Strassen salgorithm) and then to find the result matrix from these sub matrices (corresponding to the process of backtracking the execution tree). It is simple to solvethis problem for asequential machine, but it s much harder for a parallel machine. With a definite value of r, we can manually do it like [CDLW95], [LD95], and [GSv96] made (r =1,2)but the solution for the general case has not been found. In this paper, we present our method to determine all the nodes at the unspecified level r in the execution tree of Strassen s algorithm, and to show the expression representing the relation between the result matrix and the sub matrices at the level recursion r; this expression allows us to calculate directly the result matrix from the sub matrices calculated by parallel matrix multiplication algorithms at the bottom level. By combining this result with the matrix multiplication algorithms at the bottom level, we have a generalizing of the high performance parallel Strassen implementation in [CDLW95]. It can be applied to all the matrix multiplication algorithms on distributed memory computers that use Strassen s algorithm at the system level, besides the running time for these algorithms decreases when the recursion level increases hence this general solution gives us compatibility to find better implementations (which correspond with a definite value of the recursive level and adefinite matrix multiplication algorithm at the bottom level). 2 Background 2.1 Strassen s Algorithm We start by considering the formation of the matrix product Q = XY,where Q m n, X m k,and Y k n. Wewill assume that m, n, and k are all even integers. By partitioning X = X00 X 01,Y = X 10 X 11 Y00 Y 01 Q00 Q,Q= 01 Y 10 Y 11 Q 10 Q 11 360

where Q ij m 2 n 2,X ij m 2 k 2,and Y ij k 2 n 2,it can be shown [Win71, GL89] that the following computations compute Q = XY : M 0 =(X 00 + M 11 )(Y 00 + Y 11) M 1 =(X 10 + X 11 )Y 00 M 2 = X 00 (Y 01 Y 11 ) M 3 = X 11 ( Y 00 + Y 10 ) M 4 =(X 00 + X 01 )Y 11 M 5 =(X 10 X 00 )(Y 00 + Y 01 ) M 6 =(X 01 X 11 )(Y 10 + Y 11 ) Q 00 = M 0 + M 3 M 4 + M 6 Q 01 = M 1 + M 3 Q 10 = M 2 + M 4 Q 11 = M 0 + M 2 M 1 + M 5 (1) The Strassen s algorithm does the above computation recursively until one of the dimensions of the matrices is 1. 2.2 AHigh Performance Parallel Strassen Implementation In this section, we will see the principle of the hight performance parallel Strassen implementation presented in [CDLW95], which is foundation for our generalizing. First, decompose the matrix X into 2 2blocks of sub matrices X ij where i, j =0,1. Seconde, decompose further these four sub matrices into four 2 2(i.e. 4 4) blocks of sub matrices x ij where i, j =0,1,2,3. X= X00 X 01 = X 10 X 11 x 00 x 01 x 02 x 03 x 10 x 11 x 12 x 13 x 20 x 21 x 22 x 23 x 30 x 31 x 32 x 33 Similarly, perform the same decomposition on matrix Y and get: Y00 Y Y = 01 = Y 10 Y 11 y 00 y 01 y 02 y 03 y 10 y 11 y 12 y 13 y 20 y 21 y 22 y 23 y 30 y 31 y 32 y 33 Then, use the Strassen s formula to multiply the matrices X and Y,and get the following 361

sevenmatrix multiplication expressions: M 0 =(X 00 + X 11 )(Y 00 + Y 11 ) M 1 =(X 10 + X 11 )Y 00 M 2 = X 00 (Y 01 Y 11 ) M 3 = X 11 ( Y 00 + Y 10 ) M 4 =(X 00 + X 01 )Y 11 M 5 =( X 00 + X 10 )(Y 00 + Y 01 ) M 6 =(X 01 X 11 )(Y 10 + Y 11 ) Next, apply the Strassen s formula to these seven expressions to obtain 49 matrix multiplication expressions on sub matrices x and y. Taking M 0 as an example: M 00 =(x 00 + x 22 + x 11 + x 33 )(y 00 + y 22 + y 11 + y 33 ) M 01 =(x 10 + x 32 + x 11 + x 33 )(y 00 + y 22 ) M 02 =(x 00 + x 22 )(y 01 + y 23 y 11 y 33 ) M 03 =(x 11 + x 33 )(y 10 + y 32 y 00 y 22 ) M 04 =(x 00 + x 22 + x 01 + x 23 )(y 11 + y 33 ) M 05 =(x 10 + x 32 x 00 x 22 )(y 00 + y 22 + y 01 + y 23 ) M 06 =(x 01 + x 23 x 11 x 33 )(y 10 + y 32 + y 11 + y 33 ) Similarly,each of the remaining six matrix multiplication expressions M i for i =1,2,...,6 can also be expanded into six groups of matrix multiplications in terms of x and y. M 10 =(x 20 + x 22 + x 31 + x 33 )(y 00 + y 11 ) M 11 =(x 30 + x 32 + x 31 + x 33 ) y 00 M 12 =(x 20 + x 22 )(y 01 y 11 ) M 13 =(x 31 + x 33 )(y 10 y 00 ) M 14 =(x 20 + x 22 + x 21 + x 23 ) y 11 M 15 =(x 30 + x 32 x 20 x 22 )(y 00 + y 01 ) M 16 =(x 21 + x 23 x 31 x 33 )(y 10 + y 11 ) M 20 =(x 00 + x 11 )(y 02 y 22 + y 13 y 33 ) M 21 =(x 10 + x 11 )(y 02 y 22 ) M 22 = x 00 (y 03 y 23 y 13 + y 33 ) M 23 = x 11 (y 12 y 32 y 02 + y 22 ) M 24 =(x 00 + x 01 )(y 13 y 33 ) M 25 =(x 10 x 00 )(y 02 y 22 + y 03 y 23 ) M 26 =(x 01 x 11 )(y 12 y 32 + y 13 y 33 ) M 30 =(x 22 + x 33 )(y 20 y 00 + y 31 y 11 ) M 31 =(x 32 + x 33 )(y 20 y 00 ) M 32 = x 22 (y 21 y 01 y 31 + y 11 ) M 33 = x 33 (y 30 y 10 y 20 + y 00 ) M 34 =(x 22 + x 23 )(y 31 y 11 ) M 35 =(x 32 x 22 )(y 20 y 00 + y 21 y 01 ) M 36 =(x 22 + x 33 )(y 30 y 10 + y 31 y 11 ) 362

M 40 =(x 00 + x 02 + x 11 + x 13 )(y 22 + y 33 ) M 41 =(x 10 + x 12 + x 11 + x 13 ) y 22 M 42 =(x 00 + x 02 )(y 23 y 33 ) M 43 =(x 11 + x 13 )(y 32 y 22 ) M 44 =(x 00 + x 02 + x 01 + x 03 ) y 33 M 45 =(x 10 + x 13 x 00 x 02 )(y 22 + y 23 ) M 46 =(x 01 + x 03 x 11 x 13 )(y 32 + y 33 ) M 50 =(x 20 x 00 + x 31 x 11 )(y 00 + y 02 + y 11 + y 13 ) M 51 =(x 30 x 10 + x 31 x 11 )(y 00 + y 02 ) M 52 =(x 20 x 00 )(y 01 + y 03 y 11 y 13 ) M 53 =(x 31 x 11 )(y 10 + y 12 y 00 y 02 ) M 54 =(x 20 x 00 + x 21 x 01 )(y 11 + y 13 ) M 55 =(x 30 x 10 x 20 + x 00 )(y 00 + y 02 + y 01 + y 03 ) M 56 =(x 21 x 01 x 31 + x 11 )(y 10 + y 12 + y 11 + y 13 ) M 60 =(x 02 x 22 + x 13 x 22 )(y 20 + y 22 + y 31 + y 33 ) M 61 =(x 12 x 32 + x 13 x 33 )(y 20 + y 22 ) M 62 =(x 02 x 22 )(y 21 + y 23 y 31 y 33 ) M 63 =(x 13 x 33 )(y 30 + y 32 y 20 y 22 ) M 64 =(x 02 x 22 + x 03 x 23 )(y 31 + y 33 ) M 65 =(x 12 x 32 x 02 + x 22 )(y 20 + y 22 + y 21 + y 23 ) M 66 =(x 03 x 23 x 13 + x 33 )(y 30 + y 32 + y 31 + y 33 ) After finishing these 49 matrix multiplications, we need to combine the resulting M ij where i, j =0,1,...,6to form the final product matrix. q 00 q 01 q 02 q 03 Q00 Q Q = 01 = q 10 q 11 q 12 q 13 Q 10 Q 11 q 20 q 21 q 22 q 23 q 30 q 31 q 32 q 33 1, if i =4 1, if i =1 First, define some variables δ i = and γ 1otherwise i = 1otherwise blocks of sub matrices forming the product matrix Q can be written as:,the 4x4 q 00 = i S 1 δ i (M i0 + M i3 M i4 + M i6 ) q 01 = i S 1 δ i (M i2 + M i4 ) q 02 = i S 3 M i0 + M i3 M i4 + M i6 q 03 = i S 3 M i2 + M i4 q 10 = i S 1 δ i (M i1 + M i3 ) q 11 = i S 1 δ i (M i0 + M i2 M i1 + M i5 ) q 12 = i S 3 M i1 + M i3 q 13 = i S 3 M i0 + M i2 M i1 + M i5 363

Figure 1: Principle of the Strassen parallelizing in [CDLW95]. q 20 = i S 2 M i0 + M i3 M i4 + M i6 q 21 = i S 2 M i2 + M i4 q 22 = i S 4 γ i (M i0 + M i3 M i4 + M i6 ) q 23 = i S 4 γ i (M i2 + M i4 ) q 30 = i S 2 M i1 + M i3 q 31 = i S 2 M i0 + M i2 M i1 + M i5 q 32 = i S 4 γ i (M i1 + M i3 ) q 33 = i S 4 γ i (M i0 + M i2 M i1 + M i5 ) As you saw above,it is not very simple although they have only 49 matrix multiplications. It become great complicated if we want to go further -when we have 343, 2401 or more matrix multiplications. 3 Generalizing of the Parallel Strassen Implementation The principle of the method that has been presented is to parallelize the Strassen s algorithm at the system level -i.e. to stop on the recursion level r of execution tree -and the calculation of the products of sub matrices is locally performed by the processors. The most important point here is to determine the sub matrices after having applied r time the Strassen s formula, and to find the result matrix from the products of these sub matrices. In the preceding works, the solutions are given with fixed values of r (= 1, 2). But the solution for the general case has not been found. Such are the problems with which we are confronted and the solution will be presented in this section. 364

3.1 Recursion Removal infast Matrix Multiplication We represent the Strassen s formula: m l = x ij SX(l, i, j) i,j=0,1 l =0 6 and q ij = 6 l=0 m l SQ(l, i, j) i,j=0,1 y ij SY (l, i, j) (2) with Each of 7 k product can be represented as in the following: m l = x ij SX k (l, i, j) y ij SY k (l, i, j) i,j=0,n 1 l =0...7 k 1 and q ij = 7k 1 m l SQ k (l, i, j) l=0 i,j=0,n 1 (3) In fact, SX = SX 1,SY = SY 1,SQ = SQ 1. Now we have to determine values of matrices SX k,sy k,and SQ k from SX 1,SY 1,and SQ 1.Inorder to obtain this, we extend the definition of tensor product in [KHJS90] for arrays of arbitrary dimensions as followed: Definition. Let Aand Bare arrays of same dimension l and of size m 1 m 2... m l,n 1 n 2... n l respectively. Then the tensor product (TP) is an array of same dimension and of size m 1 n 1 m 2 n 2... m l n l defined by replacing each element of A with the product of the element and B. P = A B where P [i 1,i 2,..., i l ]=A[k 1,k 2,..., k l ] B [h 1,h 2,..., h l ],i j = k j n j +h j with 1 j l; Let P = n A i =(...(A 1 A 2 ) A 3 )... A n ) with A i is array of dimension l and of size m i1 m i2... m il.the following theorem allows computing directly elements of P Theorem. P [j 1,j 2,..., j l ]= n A i [h i1,h i2,..., h il ] where j k = n n m rk. s=1 365 h sk r=s+1 (4)

Proof. We prove the theorem by induction. With n =1,the proof is trivial. With n =2, it is true by the definition. Suppose it is true with n 1. Weshow that it is true with n. We have P n 1 [t 1,t 2,..., t l ]= n 1 A i [h i1,h i2,..., h il ] where t k = n 1 n 1 m rk with 1 k l; and then P n = P n 1 A n. By definition P n [j 1,j 2,..., j l ]=P n 1 [p 1,p 2,..., p l ] A n [h n1,h n2,..., h nl ]= where j k = p k m nk + h nk = m nk n 1 n 1 h sk m rk + h nk s=1 r=s+1 = n 1 n m rk + h nk = n n m rk s=1 h sk r=s+1 The theorem is proved. s=1 h sk r=s+1 s=1 n h sk r=s+1 A i [h i1,h i2,..., h il ] In particular,ifall A i have the same size m 1 m 2... m l,wehave P [j 1,j 2,..., j l ]= n A i [h i1,h i2,..., h il ] where j k = n hsk m n s k. Remark. j k = n s=1 hsk m n s k s=1 is a jk s factorization in base m k.wenote a = a 1 a 2...a l(b) the a s factorization in base b hence P [j 1,j 2,..., j l ]= n A i [h i1,h i2,..., h il ] then j k = h i1 h i2...h in(mk ). Now wereturn to our algorithm. We have following theorem: Theorem. SX k = k SX SY k = k SY SQ k = k SQ Proof. We prove the theorem by induction. Clearly it is true with k =1.Suppose it is true with k 1. The algorithm sexecution tree is balanced with depth k and degree 7. Thanks to (3), we have at the level k 1 of the tree: (5) M l = 0 i,j 2 k 1 1 0 l 7 k 1 1 X k 1,ij SX k 1 (l, i, j) Then thanks to (2) at the level k we have 0 i,j 2 k 1 1 Y k 1,ij SY k 1 (l, i, j) 366

M l [l ]= 0 i,j 1 0 i,j 1 0 l 7 k 1 1 0 l 6 = 0 i,j 1 0 i,j 1 0 i,j 2 k 1 1 0 i,j 2 k 1 1 0 i,j 2 k 1 1 0 i,j 2 k 1 1 0 l 7 k 1 1 0 l 6 X k 1,ij [i,j ]SX k 1 (l, i, j) Y k 1,ij [i,j ]SY k 1 (l, i, j) SX(l,i,j ) SY (l,i,j ) X k 1,ij [i,j ]SX k 1 (l, i, j)sx(l,i,j ) Y k 1,ij [i,j ]SY k 1 (l, i, j)sy (l,i,j ) (6) where X k 1,ij [i,j ],Y k 1,ij [i,j ]are 2 k 2 k matrices obtained by division X k 1,ij,Y k 1,ij in 4sub matrices (i,j indicate the sub matrix s quarter). We present l, l in the base 7,and i, j, i,j in the base 2 and remark that X k 1,ij [i,j ]= X k ii 2, jj 2.Then (6) becomes M[ll (7)] = X k [ii (2), jj (2)]SX k 1 (l, i, j)sx(l,i,j ) 0 ii (2),jj (2) 2k 1 1 0 ii (2),jj (2) 2k 1 1 Y k [ii (2), jj (2)]SY k 1 (l, i, j)sy (l,i,j ) 0 ll (7) 7 k 1 1 In addition, we have directly from (3): M[ll (7)] = X k [ii (2), jj (2)]SX k ll (2) (7), ii (2), jj 0 ii (2),jj (2) 2k 1 1 0 ii (2),jj (2) 2k 1 1 Y k [ii (2), jj (2)]SY k ll (7), ii (2), jj (2) 0 ll (7) 7 k 1 1 Compare (7) and (8) we have SX k ll7, ii 2, jj 2 = SX k 1 (l, i, j) SX (l,i,j ) SY k ll7, ii 2, jj 2 = SY k 1 (l, i, j) SY (l,i,j ) (7) (8) 367

By definition, we have Similarly The theorem is proved. SX k = SX k 1 SX = k SX SY k = SY k 1 SY = k SY SQ k = SQ k 1 SQ = k SQ Thanks to Theorem 3.1 and Remark 3.1 we have SX k (l, i, j) = SY k (l, i, j) = SQ k (l, i, j) = k r=1 k r=1 k r=1 SX (l r,i r,j r ) SY (l r,i r,j r ) SQ(l r,i r,j r ) (9) Apply (9) in (3) we have nodes leafs m l and all the elements of result matrix. 3.2 Generalizing Now weknown how toparallelize Strassen s algorithm in general case: firstwe stop at therecursion level r, thanks to the expressions (9) and (3),we have the entire corresponding sub matrices: r M l = X ij SX (l t,i t,j t ) i, j =0,2 r t=1 1 r (10) Y ij SY (l t,i t,j t ) i, j =0,2 r t=1 1 l =0...7 r 1 with X ij = x i 2 k r,j 2 k r......... x i 2 k r,j 2 k r +2 k r 1... x i 2 k r +2 k r 1,j 2 k r... x i 2 k r +2 k r 1,j 2 k r +2 k r 1 Y ij = y i 2 k r,j 2... y k r i 2 k r,j 2 k r +2 k r 1......... y i 2 k r +2 k r 1,j 2 k r... y i 2 k r +2 k r 1,j 2 k r +2 k r 1 i =0,2 r 1,j =0,2 r 1 368

The product M l of the sub matrices i =0,2 r 1 j=0,2 r 1 r X ij SX (l t,i t,j t ) t=1 r and Y ij SY (l t,i t,j t ) is locally calculated on each processor i =0,2 r t=1 1 j=0,2 r 1 by the sequential matrix multiplication algorithms. Finally, thanks to (9) &(3) we have directly sub matrix elements of result matrix by applying matrix additions instead of backtracking manually the recursive tree to calculate the root in [LD95], [CDLW95], and [GSv96]: Q ij = 7r 1 = 7r 1 l=0 M l SQ r (l, i, j) l=0 r (11) M l SQ(l t,i t,j t ) t=1 4 Conclusion We have presented ageneral scalable parallelization for all the matrix multiplication algorithms on distributed memory computers that use Strassen s algorithm at inter-processor level. The running time for these algorithms decreases when the recursion level increases hence this general solution gives us compatibility to find better algorithms (which correspond with adefinite value of the recursive leveland adefinite matrix multiplication algorithm at the bottom level). And from a different view, we have generalized the Strassen s formula for the case where the matrices are divided into 2 k parts (the case k =2gives usoriginal formulas) thus we have a whole new direction to parallelize the Strassen s algorithm. In addition, we are applying these ideas to all the fast matrix multiplication algorithms. References [CDLW95] Chung-Chiang Chou, Yuefan Deng, Gang Li, and Yuan Wang. Parallelizing Strassen s Method for Matrix Multiplication on Distributed Memory MIMD architectures. Computers and Math. with Applications, 30(2):4 9, 1995. [CW90] [GL89] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3):251 280, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 2nd edition, 1989. 369

[GSv96] Brian Grayson, Ajay Shah, and Robert van de Geijn. A High Performance Parallel Strassen Implementation. Parallel Processing Letters,6(1):3 12, 1996. [KHJS90] B. Kumar, Chua-Huang Huang, Rodney W. Johnson, and P. Sadayappan. A tensor product formulation of Strassen s matrix multiplication algorithm. Applied Mathematics Letters, 3(3):67 71, 1990. [LD95] [LPS92] Qingshan Luo and John B. Drake. A scalable parallel Strassen s matrix multiplication algorithm for distributed memory computers. In Proceedings of the 1995 ACM symposium on Applied computing, pages 221 226, Nashville, Tennessee, United States, 1995. ACM Press. J. Laderman, V. Y. Pan, and H. X. Sha. On Practical Algorithms for Accelerated Matrix Multiplication. Linear Adgebra and Its Applications,162:557 588, 1992. [Str69] Volker Strassen. Gaussian Elimination is not Optimal. Numer. Math., 13:354 356, 1969. [Win71] Shmuel Winograd. On multiplication of 2 x 2 matrices. Linear Algebra and its Applications, 4:381 388, 1971. 370