2.5D algorithms for distributed-memory computing

Size: px
Start display at page:

Download "2.5D algorithms for distributed-memory computing"

Transcription

1 ntroduction for distributed-memory computing C Berkeley July, / 62

2 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62

3 ntroduction Strong scaling Solving science problems faster Parallel computers can solve bigger problems weak scaling Parallel computers can also solve a fixed problem faster strong scaling Obstacles to strong scaling may increase relative cost of communication may hurt load balance How to reduce communication and maintain load balance? reduce (minimize) communication along the critical path exploit the network topology 3/ 62

4 ntroduction Strong scaling Topology-aware multicasts (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE BG/P XE6 XT5 Bandwidth (MB/sec) #nodes 512 4/

5 ntroduction Strong scaling 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 Spanning tree 3 Spanning tree 4 All 4 trees combined root 5/ 62

6 ntroduction 2.5D factorization Blocking matrix multiplication A B B A B A B A 6/ 62

7 ntroduction 2.5D factorization 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] A B 16 s (4x4) B A O(n2 /p) bytes of memory B A O(n3 /p) flops O(n2 / p) words moved O( p) messages B A 7/ 62

8 ntroduction 2.5D factorization 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99] 64 s (4x4x4) A O(n3 /p) flops B B O(n2 /p 2/3 ) words moved O(1) messages A B A O(n2 /p 2/3 ) bytes of memory A B 4 copies of matrices 8/ 62

9 ntroduction 2.5D factorization [McColl and Tiskin 99] A B 32 s (4x4x2) B A A B B O(n3 /p) flops O(n2 / c p) words moved q O( p/c 3 ) messages O(c n2 /p) bytes of memory 2 copies of matrices A 9/ 62

10 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2.5d MM(p, c)) = O(n3 /p) flops + O(n2 / c p) words moved q + O( p/c 3 ) messages *ignoring log(p) factors 10/ 62

11 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2d MM(p)) = O(n3 /p) flops + O(n2 / p) words moved + O( p) messages = cost(2.5d MM(p, 1)) *ignoring log(p) factors 11/ 62

12 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2.5d MM(c p, c)) = O(n3 /(c p)) flops + O(n2 /(c p)) words moved + O( p/c) messages = cost(2d MM(p))/c perfect strong scaling 12/ 62

13 ntroduction 2.5D factorization Strong scaling matrix multiplication 2.5D MM on BG/P (n=65,536) Percentage of machine peak D MM 2D MM ScaAPACK PDGEMM p 13/ 62

14 ntroduction 2.5D factorization 2.5D MM on 65,536 cores 2.5D MM on 16,384 nodes of BG/P Percentage of flops peak D MM 2D MM n 14/ 62

15 ntroduction 2.5D factorization Cost breakdown of MM on 65,536 cores Execution time normalized by 2D Matrix multiplication on 16,384 nodes of BG/P communication idle computation 95% reduction in comm n= , n= 2D n= ,2.5D 10 n= ,2 D 15/ 62 72,2.5D

16 ntroduction 2.5D factorization 2.5D recursive A = where is lower-triangular and is upper-triangular A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] Tiskin gives algorithm under the BSP model Bulk Synchronous Parallel considers communication and synchronization We give an alternative distributed-memory adaptation and implementation Also, we lower-bound the latency cost 16/ 62

17 ntroduction 2.5D factorization 2D blocked factorization A 17/ 62

18 ntroduction 2.5D factorization 2D blocked factorization ₀₀ ₀₀ 18/ 62

19 ntroduction 2.5D factorization 2D blocked factorization 19/ 62

20 ntroduction 2.5D factorization 2D blocked factorization S=A- 20/ 62

21 ntroduction 2.5D factorization 2D block-cyclic decomposition / 62

22 ntroduction 2.5D factorization 2D block-cyclic factorization 22/ 62

23 ntroduction 2.5D factorization 2D block-cyclic factorization 23/ 62

24 ntroduction 2.5D factorization 2D block-cyclic factorization S=A- 24/ 62

25 ntroduction 2.5D factorization A new latency lower bound for Relate volume to surface area to diameter For block size n/d does A₀₀ k₁ A₁₁ 2 Ω(n /d ) flops Ω(n2 /d) words Ω(d) msgs k₂ A₃₃ k₄ d = Ω( p) to minimize flops d = Ω( c p) to minimize words A₄₄ cr iti c al pa th kd-1 n More generally, latency bandwidth = n2 A₂₂ k₃ n Now pick d (=latency cost) 3 k₀ 25/ 62 Ad-1,d-1

26 ntroduction 2.5D factorization 2.5D factorization ₀₀ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ 26/ 62

27 ntroduction 2.5D factorization 2.5D factorization ₀₀ (B) ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ 27/ 62

28 ntroduction 2.5D factorization 2.5D factorization ₀₀ (B) ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ (C) (D) 28/ 62

29 ntroduction 2.5D factorization 2.5D factorization ook at how this update is distributed. What does it remind you of? 29/ 62

30 ntroduction 2.5D factorization 2.5D factorization ook at how this update is distributed. A B B A B Same 3D update in multiplication A B A 30/ 62

31 ntroduction 2.5D factorization 2.5D strong scaling (without pivoting) 2.5D on BG/P (n=65,536) Percentage of machine peak D (no pvt) 2D (no pvt) p 31/ 62

32 ntroduction 2.5D factorization 2.5D with pivoting A = P, where P is a permutation matrix 2.5D generic pairwise elimination (neighbor/pairwise pivoting or Givens rotations (QR)) [A. Tiskin 2007] pairwise pivoting does not produce an explicit pairwise pivoting may have stability issues for large matrices Our approach uses tournament pivoting, which is more stable than pairwise pivoting and gives explicitly pass up rows of A instead of to avoid error accumulation 32/ 62

33 ntroduction 2.5D factorization Tournament pivoting (CA-pivoting) {P,, } CA-pivot(A,n) if n b then base case {P,, } = partial-pivot(a) else recursive case T [AT 1, A2 ] = A {P1, 1, 1 } = CA-pivot(A1 ) [R1T, R2T ] = P1T A1 {P2, 2, 2 } = CA-pivot(A2 ) [S1T, S2T ] = P2T A2 {Pr,, } = partial-pivot([r1t, S1T ]) Form P from Pr, P1 and P2 end if 33/ 62

34 ntroduction 2.5D factorization Tournament pivoting Partial pivoting is not communication-optimal on a blocked matrix requires message/synchronization for each column O(n) messages needed Tournament pivoting is communication-optimal performs a tournament to determine best pivot row candidates passes up best rows of A 34/ 62

35 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ P P P P P P P P 35/ 62

36 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P P ₄₀ P P 36/ 62

37 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P ₄₀ P P P ₀₁ da ₁₀ p te ₀₁ ate d p ₁₀ ₀₁ da ₁₀ p te ₀₁ da ₁₀ p te da ₂₀ p te te da ₃₀ da p ₄₀ p te 37/ 62

38 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P ₄₀ P P P ₀₁ da ₁₀ p ₀₀ ₀₀ te ₀₁ ate ₁₀ d p ₀₀ ₀₀ ₀₁ ₀₃ da ₁₀ p te ₃₀ ₀₁ da ₁₀ p te da ₂₀ p te te da ₃₀ da p ₄₀ p ₀₀ ₀₀ ₂₀ ₁₀ ₀₂ ₀₀ ₀₀ ₀₁ te 38/ 62

39 ntroduction 2.5D factorization Strong scaling of 2.5D with tournament pivoting 2.5D on BG/P (n=65,536) Percentage of machine peak D (CA-pvt) 2D (CA-pvt) ScaAPACK PDGETRF p 39/ 62

40 ntroduction 2.5D factorization 2.5D on 65,536 cores on 16,384 nodes of BG/P (n=131,072) 100 Time (sec) 80 communication idle compute 60 2X faster 40 2X faster 20 0 NO -pi vo t NO 2D -pi vo t2 CA -pi.5d vo t2 CA -pi v D 40/ 62 ot 2.5D

41 ntroduction 2.5D factorization A = Q R where Q is orthogonal R is upper-triangular 2.5D QR using Givens rotations (generic pairwise elimination) is given by [A. Tiskin 2007] Tiskin minimizes latency and bandwidth by working on slanted panels 2.5D QR cannot be done with right-looking updates as 2.5D due to non-commutativity of orthogonalization updates 41/ 62

42 ntroduction 2.5D factorization using the YT representation The YT representation of Householder QR factorization is more work efficient when computing only R We give an algorithm that performs 2.5D QR using the YT representation The algorithm performs left-looking updates on Y Householder with YT needs fewer computation (roughly 2x) than Givens rotations Our approach achieves optimal bandwidth cost, but has O(n) latency 42/ 62

43 ntroduction 2.5D factorization 2.5D QR using YT representation 43/ 62

44 ntroduction 2.5D factorization atency-optimal 2.5D QR To reduce latency, we can employ the TSQR algorithm 1. Given n-by-b panel partition into 2b-by-b blocks 2. Perform QR on each 2b-by-b block 3. Stack computed Rs into n/2-by-b panel and recursve 4. Q given in hierarchical representation Need YT representation from hierarchical Q Take Y to be the first b columns of Q minus the identity Define T = (Y T Y ) 1 Sacrifices triangular structure of T Conjecture: stable if Q diagonal elements selected to be negative 44/ 62

45 ntroduction Given input graph G = (V, E ) Find shortest paths between each pair of nodes vi, vj Reduces to semiring matrix multiplication with a dependency along k Computational structure is similar to factorization Semiring matrix multiplication (SMM) Replace scalar multiply with scalar add Replace scalar add with scalar min Depending on processor can require more instructions 45/ 62

46 ntroduction A recursive algorithm for all-pairs shortest-paths Known alogirhtm for recusively computing APSP: 1. Given adjacency matrix A of graph G 2. Recursve on block A11 3. Compute SMM A12 A11 A12 4. Compute SMM A21 A21 A11 5. Compute SMM A22 A21 A12 6. Recursve on block A22 7. Compute SMM A21 A22 A21 8. Compute SMM A12 A12 A22 9. Compute SMM A11 A12 A21 46/ 62

47 ntroduction Block-cyclic recursive parallelization Cyclic step Blocked step P11 P11 P12 P13 P14 n P21 P22 P23 P24 P31 P32 P33 P34 P12 P13 P14 P11 P12 n/4 P21 P22 P23 P24 P21 P22 n/4 n/2 P31 P32 P33 P34 P41 P42 P43 P44 P41 P42 P43 P44 n/2 n 47/ 62

48 ntroduction 2.5D APSP 2.5D recursive parallelization is straight-forward Perform cyclic-steps using a 2.5D process grid Decompose blocked-steps using an octant of the grid Switch to 2D algorithm when grid is 2D Minimizes latency and bandwidth! 48/ 62

49 ntroduction 2.5D APSP strong scaling performance Strong scaling of DC-APSP on Hopper D n=32,768 2D n=32, D n=8,192 2D n=8, GFlops Number of compute nodes (p) 49/ 62

50 ntroduction Block-size gives latency-bandwidth tradewoff Performance of 2D DC-APSP on 256 nodes of Hopper (n=8,192) 700 2D DC-APSP 600 GFlops block size (b) 50/ 62

51 ntroduction Blocked vs block-cyclic vs cyclic decompositions Blocked layout Block-cyclic layout Green denotes fill (unique values) Cyclic layout Red denotes padding / load imbalance 51/ 62

52 ntroduction Cyclops Tensor Framework (CTF) Big idea: decompose tensors cyclically among processors pick cyclic phase to preserve partial/full symmetric structure 52/ 62

53 ntroduction 3D tensor contraction 53/ 62

54 ntroduction 3D tensor cyclic decomposition 54/ 62

55 ntroduction 3D tensor mapping 55/ 62

56 ntroduction A cyclic layout is still challenging n order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase The contracted dimensions of A and B must be mapped with the same phase And yet the virtual mapping, needs to be mapped to a physical topology, which can be any shape 56/ 62

57 ntroduction Virtual processor grid dimensions Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted Virtual processor grid dimensions serve as a new level of indirection f a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension Allows physical processor grid to be stretchable 57/ 62

58 ntroduction Virtual processor grid construction Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A X C = 58/ 62

59 ntroduction for tensors We incorporate data replication for communication minimization into CTF Replicate only one tensor/matrix (minimize bandwidth but not latency) Autotune over mappings to all possible physical topologies Select mapping with least amount of communication Achieve minimal communication for tensors of widely different sizes 59/ 62

60 ntroduction Percentage of theoretical symmetric peak Preliminary contraction results on Blue Gene/P Cyclops TF strong scaling 4D sym (CCSD) 6D sym (CCSDT) 8D sym (CCSDTQ) 1.3E E E E E E E E #nodes 60/ 62

61 ntroduction Preliminary Coupled Cluster results on Blue Gene/Q A Coupled Cluster with Double exictations (CCD) implementations is up and running Already scaled on up to 1024 nodes of BG/Q, up to 400 virtual orbitals Preliminary results already indicate performance matching NWChem Several major optimizations still in-progress Expecting significantly better scalability than any existing software 61/ 62

62 ntroduction Our contributions: 2.5D mapping of matrix multiplication A new latency lower bound for Communication-optimal 2.5D, QR, and APSP Optimal according to lower bounds [rony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90] Both are bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] is latency-optimal according to new lower bound Cyclops Tensor Framework Runtime autotuning to minimize communication Topology-aware mapping in any dimension with symmetry considerations 62/ 62

63 Rectangular collectives Backup slides 63/ 62

64 Rectangular collectives Performance of multicast (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE BG/P XE6 XT5 Bandwidth (MB/sec) #nodes /

65 Rectangular collectives Why the performance discrepancy in multicasts? Cray machines use binomial multicasts Form spanning tree from a list of nodes Route copies of message down each branch Network contention degrades utilization on a 3D torus BG/P uses rectangular multicasts Require network topology to be a k-ary n-cube Form 2n edge-disjoint spanning trees Route in different dimensional order se both directions of bidirectional network 65/ 62

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Provably efficient algorithms for numerical tensor algebra

Provably efficient algorithms for numerical tensor algebra Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations Introduction Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations Edgar Solomonik Department of EECS, C Berkeley February, 2013 Edgar Solomonik Communication-avoiding

More information

Communication-avoiding parallel algorithms for dense linear algebra

Communication-avoiding parallel algorithms for dense linear algebra Communication-avoiding parallel algorithms 1/ 49 Communication-avoiding parallel algorithms for dense linear algebra Edgar Solomonik Department of EECS, C Berkeley June 2013 Communication-avoiding parallel

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

Algorithms as multilinear tensor equations

Algorithms as multilinear tensor equations Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Provably Efficient Algorithms for Numerical Tensor Algebra

Provably Efficient Algorithms for Numerical Tensor Algebra Provably Efficient Algorithms for Numerical Tensor Algebra Edgar Solomonik Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-170 http://www.eecs.berkeley.edu/pubs/techrpts/2014/eecs-2014-170.html

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

A preliminary analysis of Cyclops Tensor Framework

A preliminary analysis of Cyclops Tensor Framework A preliminary analysis of Cyclops Tensor Framework Edgar Solomonik Jeff Hammond James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-29

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Leveraging sparsity and symmetry in parallel tensor contractions

Leveraging sparsity and symmetry in parallel tensor contractions Leveraging sparsity and symmetry in parallel tensor contractions Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CCQ Tensor Network Workshop Flatiron Institute,

More information

Algorithms as Multilinear Tensor Equations

Algorithms as Multilinear Tensor Equations Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Colorado, Boulder February 23, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Algorithms as Multilinear Tensor Equations

Algorithms as Multilinear Tensor Equations Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Illinois March 9, 2016 Algorithms as Multilinear Tensor Equations 1/36 Pervasive paradigms

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Algorithms as Multilinear Tensor Equations

Algorithms as Multilinear Tensor Equations Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich Georgia Institute of Technology, Atlanta GA, USA February 18, 2016 Edgar Solomonik Algorithms as Multilinear

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

Communication Avoiding LU Factorization using Complete Pivoting

Communication Avoiding LU Factorization using Complete Pivoting Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Matrix decompositions

Matrix decompositions Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Linear Systems of n equations for n unknowns

Linear Systems of n equations for n unknowns Linear Systems of n equations for n unknowns In many application problems we want to find n unknowns, and we have n linear equations Example: Find x,x,x such that the following three equations hold: x

More information

Solving linear equations with Gaussian Elimination (I)

Solving linear equations with Gaussian Elimination (I) Term Projects Solving linear equations with Gaussian Elimination The QR Algorithm for Symmetric Eigenvalue Problem The QR Algorithm for The SVD Quasi-Newton Methods Solving linear equations with Gaussian

More information

Towards an algebraic formalism for scalable numerical algorithms

Towards an algebraic formalism for scalable numerical algorithms Towards an algebraic formalism for scalable numerical algorithms Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June 22, 2017 Householder

More information

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark DM559 Linear and Integer Programming LU Factorization Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark [Based on slides by Lieven Vandenberghe, UCLA] Outline

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. MATRIX ANAL. APPL. Vol. 34, No. 3, pp. 1401 1429 c 2013 Society for Industrial and Applied Mathematics LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION

More information

Numerical Analysis Fall. Gauss Elimination

Numerical Analysis Fall. Gauss Elimination Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Matrix decompositions

Matrix decompositions Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers

More information

MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year 1 MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year 2013-14 OUTLINE OF WEEK 2 Linear Systems and solutions Systems of linear

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By

More information

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4 Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix

More information

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013 Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4 Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math Week # 1 Saturday, February 1, 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik Erin Carson Nicholas Knight James Demmel Electrical Engineering and Computer Sciences

More information

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3 Pivoting Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3 In the previous discussions we have assumed that the LU factorization of A existed and the various versions could compute it in a stable manner.

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture Interconnection Network Performance Performance Analysis of Interconnection Networks Bandwidth Latency Proportional to diameter Latency with contention Processor

More information

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems John von Neumann Institute for Computing Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems Darren J. Kerbyson, Kevin J. Barker, Kei Davis published in Parallel Computing:

More information

5 Irreducible representations

5 Irreducible representations Physics 129b Lecture 8 Caltech, 01/1/19 5 Irreducible representations 5.5 Regular representation and its decomposition into irreps To see that the inequality is saturated, we need to consider the so-called

More information

Avoiding Communication in Linear Algebra

Avoiding Communication in Linear Algebra Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

lecture 2 and 3: algorithms for linear algebra

lecture 2 and 3: algorithms for linear algebra lecture 2 and 3: algorithms for linear algebra STAT 545: Introduction to computational statistics Vinayak Rao Department of Statistics, Purdue University August 27, 2018 Solving a system of linear equations

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

3.5 Finite Differences and Fast Poisson Solvers

3.5 Finite Differences and Fast Poisson Solvers 3.5 Finite Differences and Fast Poisson Solvers It is extremely unusual to use eigenvectors to solve a linear system KU = F. You need to know all the eigenvectors of K, and (much more than that) the eigenvector

More information

Communication Lower Bounds for Programs that Access Arrays

Communication Lower Bounds for Programs that Access Arrays Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

On Detecting Multiple Faults in Baseline Interconnection Networks

On Detecting Multiple Faults in Baseline Interconnection Networks On Detecting Multiple Faults in Baseline Interconnection Networks SHUN-SHII LIN 1 AND SHAN-TAI CHEN 2 1 National Taiwan Normal University, Taipei, Taiwan, ROC 2 Chung Cheng Institute of Technology, Tao-Yuan,

More information

Lecture 3: QR-Factorization

Lecture 3: QR-Factorization Lecture 3: QR-Factorization This lecture introduces the Gram Schmidt orthonormalization process and the associated QR-factorization of matrices It also outlines some applications of this factorization

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Notes on Householder QR Factorization

Notes on Householder QR Factorization Notes on Householder QR Factorization Robert A van de Geijn Department of Computer Science he University of exas at Austin Austin, X 7872 rvdg@csutexasedu September 2, 24 Motivation A fundamental problem

More information

Linear Equations in Linear Algebra

Linear Equations in Linear Algebra 1 Linear Equations in Linear Algebra 1.1 SYSTEMS OF LINEAR EQUATIONS LINEAR EQUATION x 1,, x n A linear equation in the variables equation that can be written in the form a 1 x 1 + a 2 x 2 + + a n x n

More information

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas BlockMatrixComputations and the Singular Value Decomposition ATaleofTwoIdeas Charles F. Van Loan Department of Computer Science Cornell University Supported in part by the NSF contract CCR-9901988. Block

More information

A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing?

A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing? A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing? (Extended Abstract) Gianfranco Bilardi, Andrea Pietracaprina, and Geppino Pucci Dipartimento di

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra inimizing Communication in Linear Algebra Grey Ballard James Demmel lga Holtz ded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-62

More information