Practicality of Large Scale Fast Matrix Multiplication
|
|
- Lilian Jackson
- 6 years ago
- Views:
Transcription
1 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 1
2 Introduction Classical matrix multiplication is nearly ubiquitous, even though asymptotically faster algorithms have been know since 1969 Concerns about fast matrix multiplication: Practical speed Stability This talk addresses both concerns Benjamin Lipshitz IWASEP 9 2
3 Outline Strassen s algorithm New parallel algorithm Communication optimal Faster in practice Stability of Strassen Normwise error bound Diagonal scaling, improved error bounds Stability experiments Benjamin Lipshitz IWASEP 9 3
4 Recall: Strassen s fast matrix multiplication Strassen s original algorithm uses 7 multiplies and 18 adds for n = 2. It is applied recursively (blockwise). M 1 = (A 11 + A 22) (B 11 + B 22) M 2 = (A 21 + A 22) B 11 M 3 = A 11 (B 12 B 22) M 4 = A 22 (B 21 B 11) M 5 = (A 11 + A 12) B 22 M 6 = (A 21 A 11) (B 11 + B 12) M 7 = (A 12 A 22) (B 21 + B 22) C 11 = M 1 + M 4 M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 T (n) = 7 T (n/2) + O(n 2 ) T (n) = Θ (n log 7) 2 Improved by Winograd to 15 additions C 22 = M 1 M 2 + M 3 + M 6 Benjamin Lipshitz IWASEP 9 4
5 Communication costs Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) Communication is becoming more expensive relative to computation Benjamin Lipshitz IWASEP 9 5
6 Communication lower bounds for matrix multiplication Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6
7 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6
8 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6
9 Lessons from lower bounds Don t use a classical algorithm for the communication Strassen can communicate less than classical Make local multiplies as large as possible Use all available memory, up to O(n 2 /P 2/ log 2 7 ) Communication bound decreases with increased memory Send memory size messages to minimize latency Benjamin Lipshitz IWASEP 9 7
10 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) Runs all 7 multiplies in parallel each uses P/7 processors Requires 7/4 as much extra memory Requires communication, but All BFS minimizes communication if possible Runs all 7 multiplies sequentially each uses all P processors Requires 1/4 as much extra memory No immediate communication Increases bandwidth by factor of 7/4 Increases latency by factor of 7 Benjamin Lipshitz IWASEP 9 8
11 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) CAPS if enough memory and P 7 then BFS step else DFS step end if Benjamin Lipshitz IWASEP 9 8
12 Asymptotic costs analysis Strassen Classical Flops Lower Bound n ω 0 P 2D-Strassen n ω 0 Strassen-2D ( 7 8 CAPS n ω 0 P Lower Bound n 3 P Bandwidth { } max n ω 0 n, 2 PM ω 0 /2 1 P 2/ω 0 n 2 P (ω 0 1)/2 P 1/2 ) l n 3 P 2D n 3 P 2.5D n 3 P ( 7 4) l n 2 { max n ω 0 max max P 1/2 n, 2 PM ω 0 /2 1 P 2/ω 0 { n 3 PM 1/2, n 2 P 2/3 } n 2 P 1/2 { n 3 PM 1/2, n 2 P 2/3 } } Benjamin Lipshitz IWASEP 9 9
13 Performance of CAPS Strong-scaling on Franklin (Cray XT4), n = %-184% faster than previous Strassen-based algorithms 51%-84% faster than best classical algorithm Benjamin Lipshitz IWASEP 9 10
14 CAPS Summary The CAPS matrix multiplication algorithm is communication optimal matches the communication lower bounds moves asymptotically less data than all existing algorithms is faster: asymptotically and in practice faster than any parallel classical algorithm can be faster than any parallel Strassen-based algorithm we are aware of applies to other fast matrix multiplication algorithms but there might not be any other practical ones Benjamin Lipshitz IWASEP 9 11
15 Stability CAPS has the same stability properties as any other Strassen (Strassen-Winograd) algorithm. Weaker stability guarantee than classical, but still norm-wise stable. This can be improved through diagonal scaling. Two best scaling schemes give incomparable bounds Can check which bound is better in O(n 2 ) time The improved error bounds match those of matrix factorization such as classical LU and QR. Benjamin Lipshitz IWASEP 9 12
16 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Benjamin Lipshitz IWASEP 9 13
17 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Inside scaling: C = (AD)(D 1 B) Scale so each column of A has the same norm as the corresponding row of B. Explicitly: Let D ii = ( A(:, i) / B(i, :) ) 1/2. Scale A = AD, and B = D 1 B. Use Strassen for the product C = A B. Benjamin Lipshitz IWASEP 9 13
18 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 14
19 Scaling example: easy case Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 15
20 Scaling example: needs outer scaling Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 16
21 Scaling example: needs inner and outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ɛ ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 17
22 Scaling example: inner-outer better than outer-inner Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ɛ ɛ 1 1 No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 18
23 Scaling example: outer-inner better than inner-outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 19
24 Scaling example: problems scaling can t fix Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 20
25 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 21
26 Stability summary Scaling improves error bound of Strassen To be comparable to many other algorithms But still not a good as classical algorithm Applies to other fast matrix multiplication algorithms Inner-then-outer or outer-then-inner scaling are best Can choose between them by evaluating their bounds Open problem: simultaneously attain inner and outer scaling bounds? Benjamin Lipshitz IWASEP 9 22
27 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz Thank You! Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 23
28 References G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen s matrix multiplication. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 11, pages 1 12, New York, NY, USA, ACM. B. Lipshitz, G. Ballard, O. Schwartz, and J. Demmel. Communication-avoiding parallel Strassen: Implementation and performance. Technical Report EECS , UC Berkeley, May Submitted to SC Benjamin Lipshitz IWASEP 9 24
29 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 25
30 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Extras Benjamin Lipshitz IWASEP 9 26
31 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Extras Benjamin Lipshitz IWASEP 9 26
32 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal Extras Benjamin Lipshitz IWASEP 9 26
33 Data Layout Extras Benjamin Lipshitz IWASEP 9 27
34 Strassen-Winograd Algorithm ( C11 C 12 C 21 C 22 ) ( A11 A = C = A B = 12 A 21 A 22 ) ( A11 A 12 A 21 A 22 ) Q i = S i T i S 0 = A 11 T 0 = B 11 U 1 = Q i + Q 4 S 1 = A 12 T 1 = B 21 U 2 = U 1 + Q 5 S 2 = A 21 + A 22 T 2 = B 12 + B 11 U 3 = U 1 + Q 5 S 3 = S 2 A 12 T 3 = B 22 T 2 C 11 = Q 1 + Q 2 S 4 = A 11 A 21 T 4 = B 22 B 12 C 12 = U 3 + Q 6 S 5 = A 12 + S 3 T 5 = B 22 C 21 = U 2 Q 7 S 6 = A 22 T 6 = T 3 B 21 C 22 = U 2 + Q 3 Extras Benjamin Lipshitz IWASEP 9 28
35 Implications for hardware scaling Requirements so that dense matrix multiplication is computation bound. Bandwidth Latency Requirement Requirement Classic γm 1/2 β γm 3/2 α Strassen γm ω0/2 1 β γm ω0/2 α Strassen performs fewer flops and less communication, but is more demanding on the hardware. Extras Benjamin Lipshitz IWASEP 9 29
36 CAPS time breakdown n = on Franklin local multiplications communication local additions re-arrangement other perfect scaling Wall time (seconds) 10 1 strong scaling range 0.1 P=49 P=343 P=2401 Extras Benjamin Lipshitz IWASEP 9 30
37 Performance on small problems n = 3136 on Franklin Execution time, seconds e1 1e2 1e3 1e4 Number of Cores Extras Benjamin Lipshitz IWASEP 9 31
38 Sequential recursive Strassen is communication optimal Run Strassen algorithm recursively. When blocks are small enough, work in localy memory, so no further bandwidth cost { 7W ( n W (n, M) = 2, M) + O(n2 ) if 3n 2 > M O(n 2 ) otherwise Solution is ( n ω 0 W (n, M) = O M ω 0/2 1 ) Extras Benjamin Lipshitz IWASEP 9 32
39 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 33
Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013
Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationLU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationGraph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication
Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication Grey Ballard James Demmel Olga Holtz Benjamin Lipshitz Oded Schwartz Electrical Engineering and Computer Sciences
More informationAvoiding Communication in Linear Algebra
Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationCommunication Lower Bounds for Programs that Access Arrays
Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar
More informationJames Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev
. Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationc 2013 Society for Industrial and Applied Mathematics
SIAM J. MATRIX ANAL. APPL. Vol. 34, No. 3, pp. 1401 1429 c 2013 Society for Industrial and Applied Mathematics LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION
More informationAn introduction to parallel algorithms
An introduction to parallel algorithms Knut Mørken Department of Informatics Centre of Mathematics for Applications University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 1/26
More informationGeneralizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures
Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures Duc Kien Nguyen 1,IvanLavallee 2,Marc Bui 2 1 CHArt -Ecole Pratique des Hautes Etudes &UniversitéParis
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More informationStrassen-like algorithms for symmetric tensor contractions
Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationMINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA *
SIAM J. MATRIX ANAL. & APPL. Vol. 32, No. 3, pp. 866 901 2011 Society for Industrial and Applied Mathematics MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA * GREY BALLARD, JAMES DEMMEL, OLGA HOLTZ,
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationStrassen-like algorithms for symmetric tensor contractions
Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /
More informationCommunication-Avoiding Algorithms for Linear Algebra and Beyond
Communication-Avoiding Algorithms for Linear Algebra and Beyond Longer version of slides available at: www.cs.berkeley.edu/~demmel/sc16_tutorial_f inal Jim Demmel, EECS & Math Depts, UC Berkeley and many,
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More informationCOALA: Communication Optimal Algorithms
COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France Collaborators and Supporters Collaborators at Berkeley (campus
More informationBlock Lanczos Tridiagonalization of Complex Symmetric Matrices
Block Lanczos Tridiagonalization of Complex Symmetric Matrices Sanzheng Qiao, Guohong Liu, Wei Xu Department of Computing and Software, McMaster University, Hamilton, Ontario L8S 4L7 ABSTRACT The classic
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationCommunication Avoiding LU Factorization using Complete Pivoting
Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik Erin Carson Nicholas Knight James Demmel Electrical Engineering and Computer Sciences
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationEfficient Evaluation of Matrix Polynomials
Efficient Evaluation of Matrix Polynomials Niv Hoffman 1, Oded Schwartz 2, and Sivan Toledo 1(B) 1 Tel-Aviv University, Tel Aviv, Israel stoledo@tau.ac.il 2 The Hebrew University of Jerusalem, Jerusalem,
More informationChapter 12 Block LU Factorization
Chapter 12 Block LU Factorization Block algorithms are advantageous for at least two important reasons. First, they work with blocks of data having b 2 elements, performing O(b 3 ) operations. The O(b)
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationReproducible Tall-Skinny QR
Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationLow Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting
Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting James Demmel Laura Grigori Sebastien Cayrols Electrical Engineering and Computer Sciences University
More informationEnumerate all possible assignments and take the An algorithm is a well-defined computational
EMIS 8374 [Algorithm Design and Analysis] 1 EMIS 8374 [Algorithm Design and Analysis] 2 Designing and Evaluating Algorithms A (Bad) Algorithm for the Assignment Problem Enumerate all possible assignments
More informationSolution of Linear Systems
Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing
More informationFaster Johnson-Lindenstrauss style reductions
Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss
More informationChapter 2. Matrix Arithmetic. Chapter 2
Matrix Arithmetic Matrix Addition and Subtraction Addition and subtraction act element-wise on matrices. In order for the addition/subtraction (A B) to be possible, the two matrices A and B must have the
More informationCpt S 223. School of EECS, WSU
Algorithm Analysis 1 Purpose Why bother analyzing code; isn t getting it to work enough? Estimate time and memory in the average case and worst case Identify bottlenecks, i.e., where to reduce time Compare
More informationMinimizing Communication in Linear Algebra
inimizing Communication in Linear Algebra Grey Ballard James Demmel lga Holtz ded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-62
More informationMinimizing Communication in Linear Algebra. James Demmel 15 June
Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on
More informationSubquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach
Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach M A Hasan 1 and C Negre 2 1 ECE Department and CACR, University of Waterloo, Ontario, Canada 2 Team
More informationMATH 3511 Lecture 1. Solving Linear Systems 1
MATH 3511 Lecture 1 Solving Linear Systems 1 Dmitriy Leykekhman Spring 2012 Goals Review of basic linear algebra Solution of simple linear systems Gaussian elimination D Leykekhman - MATH 3511 Introduction
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationVery Sparse Random Projections
Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,
More informationTHE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley
THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS Benjamin Recht UC Berkeley MY QUIXOTIC QUEST FOR SUPERLINEAR ALGORITHMS Benjamin Recht UC Berkeley Collaborators Slides extracted
More informationAnnouncements PA2 due Friday Midterm is Wednesday next week, in class, one week from today
Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More information6.4 Krylov Subspaces and Conjugate Gradients
6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationPhysics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory
Physics 202 Laboratory 5 Linear Algebra Laboratory 5 Physics 202 Laboratory We close our whirlwind tour of numerical methods by advertising some elements of (numerical) linear algebra. There are three
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationProvably efficient algorithms for numerical tensor algebra
Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik
More informationChapter 22 Fast Matrix Multiplication
Chapter 22 Fast Matrix Multiplication A simple but extremely valuable bit of equipment in matrix multiplication consists of two plain cards, with a re-entrant right angle cut out of one or both of them
More informationParallel Scientific Computing
IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.
More informationLecture 19. Architectural Directions
Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:
More informationThe Future of LAPACK and ScaLAPACK
The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and
More informationNetworks Cannot Compute Their Diameter in Sublinear Time
Networks Cannot Compute Their Diameter in Sublinear Time Preliminary version, please check for updates Silvio Frischknecht Stephan Holzer Roger Wattenhofer {fsilvio, stholzer, wattenhofer}@ethzch Computer
More information12. Perturbed Matrices
MAT334 : Applied Linear Algebra Mike Newman, winter 208 2. Perturbed Matrices motivation We want to solve a system Ax = b in a context where A and b are not known exactly. There might be experimental errors,
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationCommunication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1
Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1 Michael Christ James Demmel Nicholas Knight Thomas Scanlon Katherine A. Yelick Electrical Engineering and Computer
More informationMatrices and Vectors
Matrices and Vectors James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University November 11, 2013 Outline 1 Matrices and Vectors 2 Vector Details 3 Matrix
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationAlgorithms and Data Structures Strassen s Algorithm. ADS (2017/18) Lecture 4 slide 1
Algorithms and Data Structures Strassen s Algorithm ADS (2017/18) Lecture 4 slide 1 Tutorials Start in week (week 3) Tutorial allocations are linked from the course webpage http://www.inf.ed.ac.uk/teaching/courses/ads/
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationThe LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen
The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen and P. C. Hansen Department of Mathematical Modelling,
More informationOn The Energy Complexity of Parallel Algorithms
On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationCost-Constrained Matchings and Disjoint Paths
Cost-Constrained Matchings and Disjoint Paths Kenneth A. Berman 1 Department of ECE & Computer Science University of Cincinnati, Cincinnati, OH Abstract Let G = (V, E) be a graph, where the edges are weighted
More informationDiscovering Fast Matrix Multiplication Algorithms via Tensor Decomposition
Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard SIAM onference on omputational Science & Engineering March, 207 ollaborators This is joint work with... Austin Benson
More informationAlgorithms for Collective Communication. Design and Analysis of Parallel Algorithms
Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationMulti-threading model
Multi-threading model High level model of thread processes using spawn and sync. Does not consider the underlying hardware. Algorithm Algorithm-A begin { } spawn Algorithm-B do Algorithm-B in parallel
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationCS Data Structures and Algorithm Analysis
CS 483 - Data Structures and Algorithm Analysis Lecture II: Chapter 2 R. Paul Wiegand George Mason University, Department of Computer Science February 1, 2006 Outline 1 Analysis Framework 2 Asymptotic
More informationNotes for Lecture 14
U.C. Berkeley CS278: Computational Complexity Handout N4 Professor Luca Trevisan 3/3/2008 Notes for Lecture 4 In the last lecture, we saw how a series of alternated squaring and zig-zag products can turn
More informationDynamic Matrix Rank.
Dynamic Matrix Rank Gudmund Skovbjerg Frandsen 1 and Peter Frands Frandsen 2 1 BRICS, University of Aarhus, Denmark gudmund@daimi.au.dk 2 Rambøll Management, Aarhus, Denmark Peter.Frandsen@r-m.com Abstract.
More information