Practicality of Large Scale Fast Matrix Multiplication

Size: px
Start display at page:

Download "Practicality of Large Scale Fast Matrix Multiplication"

Transcription

1 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 1

2 Introduction Classical matrix multiplication is nearly ubiquitous, even though asymptotically faster algorithms have been know since 1969 Concerns about fast matrix multiplication: Practical speed Stability This talk addresses both concerns Benjamin Lipshitz IWASEP 9 2

3 Outline Strassen s algorithm New parallel algorithm Communication optimal Faster in practice Stability of Strassen Normwise error bound Diagonal scaling, improved error bounds Stability experiments Benjamin Lipshitz IWASEP 9 3

4 Recall: Strassen s fast matrix multiplication Strassen s original algorithm uses 7 multiplies and 18 adds for n = 2. It is applied recursively (blockwise). M 1 = (A 11 + A 22) (B 11 + B 22) M 2 = (A 21 + A 22) B 11 M 3 = A 11 (B 12 B 22) M 4 = A 22 (B 21 B 11) M 5 = (A 11 + A 12) B 22 M 6 = (A 21 A 11) (B 11 + B 12) M 7 = (A 12 A 22) (B 21 + B 22) C 11 = M 1 + M 4 M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 T (n) = 7 T (n/2) + O(n 2 ) T (n) = Θ (n log 7) 2 Improved by Winograd to 15 additions C 22 = M 1 M 2 + M 3 + M 6 Benjamin Lipshitz IWASEP 9 4

5 Communication costs Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) Communication is becoming more expensive relative to computation Benjamin Lipshitz IWASEP 9 5

6 Communication lower bounds for matrix multiplication Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

7 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

8 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

9 Lessons from lower bounds Don t use a classical algorithm for the communication Strassen can communicate less than classical Make local multiplies as large as possible Use all available memory, up to O(n 2 /P 2/ log 2 7 ) Communication bound decreases with increased memory Send memory size messages to minimize latency Benjamin Lipshitz IWASEP 9 7

10 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) Runs all 7 multiplies in parallel each uses P/7 processors Requires 7/4 as much extra memory Requires communication, but All BFS minimizes communication if possible Runs all 7 multiplies sequentially each uses all P processors Requires 1/4 as much extra memory No immediate communication Increases bandwidth by factor of 7/4 Increases latency by factor of 7 Benjamin Lipshitz IWASEP 9 8

11 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) CAPS if enough memory and P 7 then BFS step else DFS step end if Benjamin Lipshitz IWASEP 9 8

12 Asymptotic costs analysis Strassen Classical Flops Lower Bound n ω 0 P 2D-Strassen n ω 0 Strassen-2D ( 7 8 CAPS n ω 0 P Lower Bound n 3 P Bandwidth { } max n ω 0 n, 2 PM ω 0 /2 1 P 2/ω 0 n 2 P (ω 0 1)/2 P 1/2 ) l n 3 P 2D n 3 P 2.5D n 3 P ( 7 4) l n 2 { max n ω 0 max max P 1/2 n, 2 PM ω 0 /2 1 P 2/ω 0 { n 3 PM 1/2, n 2 P 2/3 } n 2 P 1/2 { n 3 PM 1/2, n 2 P 2/3 } } Benjamin Lipshitz IWASEP 9 9

13 Performance of CAPS Strong-scaling on Franklin (Cray XT4), n = %-184% faster than previous Strassen-based algorithms 51%-84% faster than best classical algorithm Benjamin Lipshitz IWASEP 9 10

14 CAPS Summary The CAPS matrix multiplication algorithm is communication optimal matches the communication lower bounds moves asymptotically less data than all existing algorithms is faster: asymptotically and in practice faster than any parallel classical algorithm can be faster than any parallel Strassen-based algorithm we are aware of applies to other fast matrix multiplication algorithms but there might not be any other practical ones Benjamin Lipshitz IWASEP 9 11

15 Stability CAPS has the same stability properties as any other Strassen (Strassen-Winograd) algorithm. Weaker stability guarantee than classical, but still norm-wise stable. This can be improved through diagonal scaling. Two best scaling schemes give incomparable bounds Can check which bound is better in O(n 2 ) time The improved error bounds match those of matrix factorization such as classical LU and QR. Benjamin Lipshitz IWASEP 9 12

16 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Benjamin Lipshitz IWASEP 9 13

17 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Inside scaling: C = (AD)(D 1 B) Scale so each column of A has the same norm as the corresponding row of B. Explicitly: Let D ii = ( A(:, i) / B(i, :) ) 1/2. Scale A = AD, and B = D 1 B. Use Strassen for the product C = A B. Benjamin Lipshitz IWASEP 9 13

18 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 14

19 Scaling example: easy case Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 15

20 Scaling example: needs outer scaling Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 16

21 Scaling example: needs inner and outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ɛ ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 17

22 Scaling example: inner-outer better than outer-inner Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ɛ ɛ 1 1 No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 18

23 Scaling example: outer-inner better than inner-outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 19

24 Scaling example: problems scaling can t fix Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 20

25 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 21

26 Stability summary Scaling improves error bound of Strassen To be comparable to many other algorithms But still not a good as classical algorithm Applies to other fast matrix multiplication algorithms Inner-then-outer or outer-then-inner scaling are best Can choose between them by evaluating their bounds Open problem: simultaneously attain inner and outer scaling bounds? Benjamin Lipshitz IWASEP 9 22

27 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz Thank You! Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 23

28 References G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen s matrix multiplication. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 11, pages 1 12, New York, NY, USA, ACM. B. Lipshitz, G. Ballard, O. Schwartz, and J. Demmel. Communication-avoiding parallel Strassen: Implementation and performance. Technical Report EECS , UC Berkeley, May Submitted to SC Benjamin Lipshitz IWASEP 9 24

29 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 25

30 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Extras Benjamin Lipshitz IWASEP 9 26

31 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Extras Benjamin Lipshitz IWASEP 9 26

32 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal Extras Benjamin Lipshitz IWASEP 9 26

33 Data Layout Extras Benjamin Lipshitz IWASEP 9 27

34 Strassen-Winograd Algorithm ( C11 C 12 C 21 C 22 ) ( A11 A = C = A B = 12 A 21 A 22 ) ( A11 A 12 A 21 A 22 ) Q i = S i T i S 0 = A 11 T 0 = B 11 U 1 = Q i + Q 4 S 1 = A 12 T 1 = B 21 U 2 = U 1 + Q 5 S 2 = A 21 + A 22 T 2 = B 12 + B 11 U 3 = U 1 + Q 5 S 3 = S 2 A 12 T 3 = B 22 T 2 C 11 = Q 1 + Q 2 S 4 = A 11 A 21 T 4 = B 22 B 12 C 12 = U 3 + Q 6 S 5 = A 12 + S 3 T 5 = B 22 C 21 = U 2 Q 7 S 6 = A 22 T 6 = T 3 B 21 C 22 = U 2 + Q 3 Extras Benjamin Lipshitz IWASEP 9 28

35 Implications for hardware scaling Requirements so that dense matrix multiplication is computation bound. Bandwidth Latency Requirement Requirement Classic γm 1/2 β γm 3/2 α Strassen γm ω0/2 1 β γm ω0/2 α Strassen performs fewer flops and less communication, but is more demanding on the hardware. Extras Benjamin Lipshitz IWASEP 9 29

36 CAPS time breakdown n = on Franklin local multiplications communication local additions re-arrangement other perfect scaling Wall time (seconds) 10 1 strong scaling range 0.1 P=49 P=343 P=2401 Extras Benjamin Lipshitz IWASEP 9 30

37 Performance on small problems n = 3136 on Franklin Execution time, seconds e1 1e2 1e3 1e4 Number of Cores Extras Benjamin Lipshitz IWASEP 9 31

38 Sequential recursive Strassen is communication optimal Run Strassen algorithm recursively. When blocks are small enough, work in localy memory, so no further bandwidth cost { 7W ( n W (n, M) = 2, M) + O(n2 ) if 3n 2 > M O(n 2 ) otherwise Solution is ( n ω 0 W (n, M) = O M ω 0/2 1 ) Extras Benjamin Lipshitz IWASEP 9 32

39 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 33

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013 Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication Grey Ballard James Demmel Olga Holtz Benjamin Lipshitz Oded Schwartz Electrical Engineering and Computer Sciences

More information

Avoiding Communication in Linear Algebra

Avoiding Communication in Linear Algebra Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

Communication Lower Bounds for Programs that Access Arrays

Communication Lower Bounds for Programs that Access Arrays Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar

More information

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev . Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. MATRIX ANAL. APPL. Vol. 34, No. 3, pp. 1401 1429 c 2013 Society for Industrial and Applied Mathematics LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION

More information

An introduction to parallel algorithms

An introduction to parallel algorithms An introduction to parallel algorithms Knut Mørken Department of Informatics Centre of Mathematics for Applications University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 1/26

More information

Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures

Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures Duc Kien Nguyen 1,IvanLavallee 2,Marc Bui 2 1 CHArt -Ecole Pratique des Hautes Etudes &UniversitéParis

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA *

MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA * SIAM J. MATRIX ANAL. & APPL. Vol. 32, No. 3, pp. 866 901 2011 Society for Industrial and Applied Mathematics MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA * GREY BALLARD, JAMES DEMMEL, OLGA HOLTZ,

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /

More information

Communication-Avoiding Algorithms for Linear Algebra and Beyond

Communication-Avoiding Algorithms for Linear Algebra and Beyond Communication-Avoiding Algorithms for Linear Algebra and Beyond Longer version of slides available at: www.cs.berkeley.edu/~demmel/sc16_tutorial_f inal Jim Demmel, EECS & Math Depts, UC Berkeley and many,

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

COALA: Communication Optimal Algorithms

COALA: Communication Optimal Algorithms COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France Collaborators and Supporters Collaborators at Berkeley (campus

More information

Block Lanczos Tridiagonalization of Complex Symmetric Matrices

Block Lanczos Tridiagonalization of Complex Symmetric Matrices Block Lanczos Tridiagonalization of Complex Symmetric Matrices Sanzheng Qiao, Guohong Liu, Wei Xu Department of Computing and Software, McMaster University, Hamilton, Ontario L8S 4L7 ABSTRACT The classic

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

Communication Avoiding LU Factorization using Complete Pivoting

Communication Avoiding LU Factorization using Complete Pivoting Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik Erin Carson Nicholas Knight James Demmel Electrical Engineering and Computer Sciences

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Efficient Evaluation of Matrix Polynomials

Efficient Evaluation of Matrix Polynomials Efficient Evaluation of Matrix Polynomials Niv Hoffman 1, Oded Schwartz 2, and Sivan Toledo 1(B) 1 Tel-Aviv University, Tel Aviv, Israel stoledo@tau.ac.il 2 The Hebrew University of Jerusalem, Jerusalem,

More information

Chapter 12 Block LU Factorization

Chapter 12 Block LU Factorization Chapter 12 Block LU Factorization Block algorithms are advantageous for at least two important reasons. First, they work with blocks of data having b 2 elements, performing O(b 3 ) operations. The O(b)

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Reproducible Tall-Skinny QR

Reproducible Tall-Skinny QR Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting James Demmel Laura Grigori Sebastien Cayrols Electrical Engineering and Computer Sciences University

More information

Enumerate all possible assignments and take the An algorithm is a well-defined computational

Enumerate all possible assignments and take the An algorithm is a well-defined computational EMIS 8374 [Algorithm Design and Analysis] 1 EMIS 8374 [Algorithm Design and Analysis] 2 Designing and Evaluating Algorithms A (Bad) Algorithm for the Assignment Problem Enumerate all possible assignments

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

Faster Johnson-Lindenstrauss style reductions

Faster Johnson-Lindenstrauss style reductions Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss

More information

Chapter 2. Matrix Arithmetic. Chapter 2

Chapter 2. Matrix Arithmetic. Chapter 2 Matrix Arithmetic Matrix Addition and Subtraction Addition and subtraction act element-wise on matrices. In order for the addition/subtraction (A B) to be possible, the two matrices A and B must have the

More information

Cpt S 223. School of EECS, WSU

Cpt S 223. School of EECS, WSU Algorithm Analysis 1 Purpose Why bother analyzing code; isn t getting it to work enough? Estimate time and memory in the average case and worst case Identify bottlenecks, i.e., where to reduce time Compare

More information

Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra inimizing Communication in Linear Algebra Grey Ballard James Demmel lga Holtz ded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-62

More information

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra. James Demmel 15 June Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on

More information

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach M A Hasan 1 and C Negre 2 1 ECE Department and CACR, University of Waterloo, Ontario, Canada 2 Team

More information

MATH 3511 Lecture 1. Solving Linear Systems 1

MATH 3511 Lecture 1. Solving Linear Systems 1 MATH 3511 Lecture 1 Solving Linear Systems 1 Dmitriy Leykekhman Spring 2012 Goals Review of basic linear algebra Solution of simple linear systems Gaussian elimination D Leykekhman - MATH 3511 Introduction

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

Very Sparse Random Projections

Very Sparse Random Projections Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,

More information

THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley

THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS Benjamin Recht UC Berkeley MY QUIXOTIC QUEST FOR SUPERLINEAR ALGORITHMS Benjamin Recht UC Berkeley Collaborators Slides extracted

More information

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory Physics 202 Laboratory 5 Linear Algebra Laboratory 5 Physics 202 Laboratory We close our whirlwind tour of numerical methods by advertising some elements of (numerical) linear algebra. There are three

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Provably efficient algorithms for numerical tensor algebra

Provably efficient algorithms for numerical tensor algebra Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik

More information

Chapter 22 Fast Matrix Multiplication

Chapter 22 Fast Matrix Multiplication Chapter 22 Fast Matrix Multiplication A simple but extremely valuable bit of equipment in matrix multiplication consists of two plain cards, with a re-entrant right angle cut out of one or both of them

More information

Parallel Scientific Computing

Parallel Scientific Computing IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

The Future of LAPACK and ScaLAPACK

The Future of LAPACK and ScaLAPACK The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and

More information

Networks Cannot Compute Their Diameter in Sublinear Time

Networks Cannot Compute Their Diameter in Sublinear Time Networks Cannot Compute Their Diameter in Sublinear Time Preliminary version, please check for updates Silvio Frischknecht Stephan Holzer Roger Wattenhofer {fsilvio, stholzer, wattenhofer}@ethzch Computer

More information

12. Perturbed Matrices

12. Perturbed Matrices MAT334 : Applied Linear Algebra Mike Newman, winter 208 2. Perturbed Matrices motivation We want to solve a system Ax = b in a context where A and b are not known exactly. There might be experimental errors,

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1

Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1 Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1 Michael Christ James Demmel Nicholas Knight Thomas Scanlon Katherine A. Yelick Electrical Engineering and Computer

More information

Matrices and Vectors

Matrices and Vectors Matrices and Vectors James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University November 11, 2013 Outline 1 Matrices and Vectors 2 Vector Details 3 Matrix

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Algorithms and Data Structures Strassen s Algorithm. ADS (2017/18) Lecture 4 slide 1

Algorithms and Data Structures Strassen s Algorithm. ADS (2017/18) Lecture 4 slide 1 Algorithms and Data Structures Strassen s Algorithm ADS (2017/18) Lecture 4 slide 1 Tutorials Start in week (week 3) Tutorial allocations are linked from the course webpage http://www.inf.ed.ac.uk/teaching/courses/ads/

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen

The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen and P. C. Hansen Department of Mathematical Modelling,

More information

On The Energy Complexity of Parallel Algorithms

On The Energy Complexity of Parallel Algorithms On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Cost-Constrained Matchings and Disjoint Paths

Cost-Constrained Matchings and Disjoint Paths Cost-Constrained Matchings and Disjoint Paths Kenneth A. Berman 1 Department of ECE & Computer Science University of Cincinnati, Cincinnati, OH Abstract Let G = (V, E) be a graph, where the edges are weighted

More information

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard SIAM onference on omputational Science & Engineering March, 207 ollaborators This is joint work with... Austin Benson

More information

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Multi-threading model

Multi-threading model Multi-threading model High level model of thread processes using spawn and sync. Does not consider the underlying hardware. Algorithm Algorithm-A begin { } spawn Algorithm-B do Algorithm-B in parallel

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Introduction to Algorithms

Introduction to Algorithms Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

More information

CS Data Structures and Algorithm Analysis

CS Data Structures and Algorithm Analysis CS 483 - Data Structures and Algorithm Analysis Lecture II: Chapter 2 R. Paul Wiegand George Mason University, Department of Computer Science February 1, 2006 Outline 1 Analysis Framework 2 Asymptotic

More information

Notes for Lecture 14

Notes for Lecture 14 U.C. Berkeley CS278: Computational Complexity Handout N4 Professor Luca Trevisan 3/3/2008 Notes for Lecture 4 In the last lecture, we saw how a series of alternated squaring and zig-zag products can turn

More information

Dynamic Matrix Rank.

Dynamic Matrix Rank. Dynamic Matrix Rank Gudmund Skovbjerg Frandsen 1 and Peter Frands Frandsen 2 1 BRICS, University of Aarhus, Denmark gudmund@daimi.au.dk 2 Rambøll Management, Aarhus, Denmark Peter.Frandsen@r-m.com Abstract.

More information