Level-2 BLAS. Matrix-Vector operations with O(n 2 ) operations (sequentially) BLAS-Notation: S --- single precision G E general matrix M V --- vector

evel-2 BS trx-vector opertos wth 2 opertos sequetlly BS-Notto: S --- sgle precso G E geerl mtrx V --- vector defes SGEV, mtrx-vector product: r y r α x β r y ther evel-2 BS: Solvg trgulr system x wth trgulr mtrx

evel-3 BS trx-trx opertos wth 3 opertos sequetlly BS-Notto: S --- sgle precso G E geerl mtrx --- mtrx defes SGE, mtrx-mtrx product: C α B β C PCK suroutes for solvg ler equtos, lest squres prolems, QR-decomposto, egevlues, sgulr vlues sed o BS 2

3 2.2 lyss of trx-vector- Product m m m IR c IR IR,,,,...,,..., 2.2. Vectorzto m m m m m m m m m m c c DT-products of legth m m SXPY s of legth GXPY

Pseudocode: -form: c ; for,, for,,m c c ed ed DT product c, Dot product of -th row of wth vector c 4

Pseudocode: -form: c ; for,,m for,, c c ed ed c c SXPY SXPY updtg vector c wth -th colum of GXPY GXPY: Sequece of SXPY s relted to the sme vector dvtge: vector c, tht s updted, c e ept fst memory! 5 No ddtol dt trsfer.

GXPY SXPY: x : x αy GXPY: x for ed x : x : x α y Seres of SXPYs regrdg the sme vector x. dvtge: ess dt trsfer! 6

J s <, > <, m > 2.2.2 Prllelzto y uldg locs Reduce mtrx-vector product o smller mtrx-vector products. {,2,..., } I I I {,2,..., m} J J 2 J S 2 R I dsuct: J I Use 2-dmesol rry of processors P rs. P rs gets mtrx loc rs :I r,j s, s :J s, c r :ci r. J for for I r rs s J s c r I r c r S S rs s s s c s r 7

Pseudocode for r,,r for s,,s c r s rs s ; ed ed Smll, depedet mtrx-vector products. No commucto ecessry durg computtos! for r,,r c r ; for s,,s c r c r c r s ; ed ed Blocwse collecto d ddto of vectors. Rowwse commucto! F-. 8

9 Specl cse: S c 2 2 P P 2.. No commucto ecessry etwee processor P,,P R The computto of s vectorzle y GXPY s. R : 2 2 2 2 c re depedet. The collecto of prtl results from processor P,,P r. F-. Fl sum oe processor: vectorzle y GXPY s.

Rules Ier loops of progrm should e smple, vectorzle 2 uter loop of progrmm should e susttl, depedet, prllelzle for susttl d prllelzle for. ed smple, vectorzle ed 3 Reuse of dt Cche, mml dt trsfer, locg 2

2 2.2.3 c for Bded trx β β β β,,,, 22, β β Bdwdth β symmetrc 2β dgols: m dg. β sudg. β superdg. β: trdgol,,,,, 2, β β β β β β

22,,,,, 2, β β β β β β,,,,,, N N N β β β β β β Storg etres dgolwse: 2β mtrx sted of 2. row for s s,...,,, d s d s β β row dgol s β β s d s d s s [ ] { } { } [ ] r l s,,m, mx, β β [ ] { } { } [ ] s s r l s s,,m mx,,

Computto of the mtrx-vector product sed o ths storge scheme o vector processors: For,,: c r, s s s l r s l, s s For s -β : β For mx{-s,} : m{-s,} c c s s ed ed s Geerl trde No SXPY or For : For s mx{-β,-} : mx{β,-} c c s Prtl Dot product s ed ed 23 Sprsty less opertos, ut lso loss of effcecy

Bd Prllel Prttog: for ed R <, > U I c r r s l r s I r, s ; dsuct Processor P r gets rows to dex set I r :[m r, r ] order to compute ts prt of the fl vector c. Wht prt of vector does processor P r eed order to compute ts prt of c? 24

Necessry for I r : s : s m r l m r m r mx { β, m } mx{ m β,} r r s r r r r m { β, } m{ β, } r r Processor P r wth dex set I r eeds from the dces [ mx{, m β },m{ β} ], r r 25

26 2.3 lyss of trx-trx Product q q m m c B C B,...,,...,,...,,...,,...,,...,,, ed ed c q For For m : : * * * * * * * * * * * * * * * * * * * * m m c

2.3. Vectorzto -Form: lgorthm For : For : q For : m c c ; ed ed ed Dot product of legth m c B for ll, ll etres c re fully computed, oe fter other. ccess to d C s rowwse, to B columwse. 27

28 ther vew o the mtrx-mtrx product: m T m m T e e e e 2 trx cosdered s comto of colums or rows T T B B e e B e e B, s sum of full mtrces B y outer product of the -th colum of d the -th row of B Full x q - mtrces

-Form, lgorthm 2 For : q For : m For : c c ; ed ed ed c c Vector updte c. SXPY GXPY c Sequece of SXPY s for the sme vector c. C computed columwse; ccess to columwse. ccess to B columwse, ut delyed. 29

-Form, lgorthm 3 For : m For : q For : c c ; ed ed ed c c SXPY Vector updte c. No GXPY Sequece of SXPY s for dfferet vectors c. ccess to columwse. ccess to B delyed. C computed wth termedte vlues c whch re computed columwse. 3

vervew over dfferet Forms ccess to y ccess to B y Comput -to of C row ----- ------ row colum colum colum row row colum ------- ------- row row row colum colum colum Comput drect delyed delyed drect delyed delyed -to of c Vector operto Vector leght DT GXPY SXPY DT GXPY SXPY m q q m m m Better: GXPY, loger vector legth. ccess to mtrces ccordg to storge scheme rowwse or columwse 3

2.3.2 trx-trx Prllel R <, > U I, <, m > U K, <, q > U J r r S s s T t t. Dstrute the locs reltve to dex sets I r, K s, d J t to processor rry P rst : K s J t J t K s. I r rs c s rt I r B st Processor P rst computes smll mtrx-mtrx product. ll Processors prllel. 2 Compute sum y f- s: c rt c rt c s rt S s rs s B 32 st

Specl Cse S J t J t. I r r c rt I r B t I ths cse ech processor P rt c compute ts prt of c, c rt, depedetly wthout y commucto. Ech processor eeds full loc of rows of, reltve to dex set I r, d full loc of colums of B, reltve to dex set J t, order to compute c rt reltve to rows I d colums J t. 33

Specl Cse S J t J t. I r r c rt I r B t Especlly wth *q processors ech processor hs to compute oe DT product wth m prllel tme steps. c rt m r t F- y m q ddtol processors for ll these Dot products reduces the umer of prllel tme steps to logm. 34

Grulrty for BS BS: perto Formul memory Grulrty BS- XPY: 2 αxy 2 < BS-2 GEV: 2 2 αxβy 2 3 2 BS-3 GE: 2 3 αbβc 4 2 /2 BS-3 hve the est opertos to memory rto! 35

D-Prllelzto *B D: p processors ler, ech processor gets full d colum slce of B, computg the relted colum slce of CB, B Commucto: N 2 p for d N*N/p*pN 2 for B Grulrty: N 3 /N 2 pn/p, B 2., B /p For : For : For : C, C,, B, Blocg oly, the colums of B! 36

2D-Prllelzto *B 2D: p processors squre, q:sqrtp ech proc. gets row slce of d colum slce of B computg full suloc of CB 2. Compre: S efore! B B 2. B N/q N/q Commucto: N 2 p /2 for d N 2 p /2 for B Grulrty: N 3 /2N 2 p.5 N/2p.5 For : For : For : C, C,, B, Blocg d, the colums of B d the rows of! 37

3D-Prllelzto *B 3D: p processors cuc, ech processor gets suloc of d suloc of B, computg prt of suloc of CB, ddtol f- to collect prts to full suloc of C. q p /3. Commucto: N 2 p /3 for d for B p*n 2 /p 2/3 p * locsze f-: N 2 p /3 the sme Grulrty: N 3 /3N 2 p /3 N/3p /3 For : For : For : C, C,, B, 38

Product of red d lue gves prt of grey, tht hve to e dded up to gve the full grey loc. 3D locg red, lue, reltve to lc. 39

3. Guss Elmto: Bsc Propertes 3. er Equtos wth dese mtrces x x x x System of ler equtos: x x x Solve Geerte smpler ler equtos mtrces. Strt wth Trsform trgulr form: 2. U

2 2 2 2 2 2 2 22 2 2 2 2 3 3 3 2 : 2 22 2 2 2 22 2 32 2 2 3 2 2 2 3 2 33 2 32 2 2 2 23 2 22 3 2 2

3 3 3 4 4 3 2 : 3 33 3 3 3 33 3 43 3 3 3 3 3 3 33 2 2 2 23 2 22 3 2 3 U 2 2 2 22 2

We ssume tht o pvotg s ecessry smplfy or ρ > for,2,..., lgorthm: For : - For : l, / ; ed For : For : l ; ed ed ed I prctce: Iclude pvotg d clude rght hd sde. There s stll to solve trgulr system U! 4

5 Itermedte systems,,2,, wth d U.,,, prt of tht wll e used d chged the followg computtos.

6 Defe uxlry mtrces:,, l l -th colum, 2 l l l U d

7 Ech elmto step c e wrtte terms of the uxlry mtrces: I I I I U K : I I U wth U upper trgulr d lower trgulr. Theorem 2: d therefore U. dvtge: Every further prolem x c e reduced to Ux Solve two trgulr prolems Uxy d Uxy.

8 Theorem 2: d therefore U : * * * * : for I I I I I I 2 I I I I [ ] I I I I I I I I 2 2

3.2 Vectorzto of GE -form stdrd form: For : - For : l,, /, ; ed For : For :,, l,, ; ed ed ed Vector operto α x r SXPY rows d No GXPY U computed rowwse, columwse. 9

lredy computed, rems uchged, ot used ymore U ewly computed updted every step Stdrd form s lso clled rghtloog GE.

Frst Elmto step: Compute frst colum of Updte

Secod step: 2 Compute secod colum of Updte 2 2

Secod step: 3 Compute thrd colum of Updte 3 3

-st step: U Compute -th colum of Updte 4

Rules for dfferet,, forms: I the followg we g terchge the loops. Necessry codtos: < < Furthermore: Iermost dex,, or determes whether the computto s doe row, colum, or loc-wse. Weghts l hve to e computed efore they re used to elmte relted etres. 5

< -form: < For 2 : For : - l,, /, ; For :,, l,, ; ed ed ed GXPY. 6

lredy computed, ot used y more U lredy computed d prtlly used ewly computed uchged, ot used d U computed rowwse. Compute l,, the SXPY for st d -th row; the l,2 d so o 7

Frst step 8

Secod step 2 9

--st step U - 2

-form: < < For 2 : For 2 : l,-,- / -,- ; For : -,, l,, ; ed ew row Dot product left prt ed For : For : -,, l,, ; ed ed Dot product rght prt ed Compute l, d updte,2 ; the compute l,2 d updte,2 d,3,. ccumultg, 2

< < -form: For 2 : For : l,-,- / -,- ; ed For : - For :,, l,, ; ed ed ed α x r ew colum of GXPY. 22

eft loog GE computed, ot used U lredy computed d used uchged, ot used -, ewly computed 23

Frst step 24

Secod step 25

--st step U 26

vervew ccess to d U ccess to Computt o of U Computt o of Vector operto Vector legth row colum row colum colum colum --------- colum --------- row colum row row row row row colum colum colum colum row row colum colum SXPY SXPY GXPY DT GXPY DT 2/3 2/3 2/3 /3 2/3 /3 Vector legth verge of occurg vector legths 27 ptml form depeds o storge of mtrces d vector legth.