Restructuring the Symmetric QR Algorithm for Performance. Field Van Zee Gregorio Quintana-Orti Robert van de Geijn

Size: px

Start display at page:

Download "Restructuring the Symmetric QR Algorithm for Performance. Field Van Zee Gregorio Quintana-Orti Robert van de Geijn"

Imogen Cummings
6 years ago
Views:

1 Restructuring the Symmetric QR Algorithm for Performance Field Van Zee regorio Quintana-Orti Robert van de eijn 1

2 For details: Field Van Zee, Robert van de eijn, and regorio Quintana-Orti. Restructuring the QR algorithm for Performance. ACM TOMS. Accepted (pending minor modifications) This work was supported by l UTAustin-Portugal Colab program l Microsoft l NSF under grants OCI , CCF , and OCI Any opinions, findings and conclusions or recommendations expressed in this materialare those of the author(s) and do not necessarily reflect the views of the NationalScience Foundation (NSF). 2

3 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 3

4 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 4

5 Symmetric EVD/SVD: 50+ Years of Progress l Recent progress focuses a lot on the mathematics side Divide & Conquer (Cuppen s) algorithm (D&C) Method of Relatively Robust Representations (MRRR) l Occasional revisit of Jacobi s method l Progress on QR has been for non-symmetric problem. Aggressive Early Deflation Multishift 5

6 Two Insights l WHEN COMPUTIN THE DENSE EVD (all eigenvalues and vectors), D&C and MRRR have hidden O(n 3 ) cost l QR becomes competitive if rotations are applied in batches Classical QR: cast in terms of vector-vector operations Batched application: cast in terms of computation that reuses data in cache, like matrix-matrix operations. 6

7 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 7

8 The Hidden Cost of D&C and MRRR l Start with symmetric, dense A l Reduce to tridiagonal form: l Compute Spectral Decomposition of T: l Backtransform: 8

9 Reduction to Tridiagonal Form 9

10 Reduction to Tridiagonal Form 10

11 Reduction to Tridiagonal Form 11

12 Reduction to Tridiagonal Form 12

13 Backtransformation 13

14 Backtransformation 14

15 Backtransformation 15

16 Backtransformation 16

17 Cost of QR algorithm l Start with symmetric, dense A l Reduce to tridiagonal form: l Form Q A l Compute Spectral Decomposition of T while updating Q A 17

18 Form Q A 18

19 Form Q A 19

20 Form Q A 20

21 Form Q A 21

22 Form Q A 22

23 Cost l Backtransformation: 2 n 3 flops l Form Q A : 4/3 n 3 flops l Hidden cost of MRRR and D&C: 2/3 n 3 flops EVD OF A DENSE MATRIX!!! (All eigenvalues and eigenvectors) 23

24 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 24

25 Classical QR algorithm 25

26 26

27 T 27

28 T 28

29 T 29

30 T 30

31 T 31

32 T 32

33 T 33

34 T 34

35 T 35

36 T 36

37 T 37

38 T 38

39 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 39

40 Accumulating Rotations (LAPACK) T 40

41 Accumulating Rotations (LAPACK) T 41

42 Accumulating Rotations (LAPACK) T 42

43 Accumulating Rotations (LAPACK) T 43

44 Accumulating Rotations (LAPACK) Apply one sweep worth of rotations. Makes application like level-2 BLAS 44

45 Accumulating Rotations (libflame) T 45

46 Accumulating Rotations (libflame) T 46

47 Accumulating Rotations (libflame) T 47

48 Accumulating Rotations (libflame) T 48

49 Accumulating Rotations (libflame) T 49

50 Accumulating Rotations (libflame) T 50

51 Accumulating Rotations (libflame) T 51

52 Accumulating Rotations (libflame) T 52

53 Accumulating Rotations (libflame) T 53

54 Accumulating Rotations (libflame) T 54

55 Accumulating Rotations (libflame) T 55

56 Accumulating Rotations (libflame) T 56

57 Applying Rotations (libflame) 57

58 Applying Rotations (libflame) 58

59 Optimization l Applying a batch of ivens rotations: O(n 2 b) operations on O(n 2 ) data. Can attain level-3 BLAS performance 59

60 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 60

61 Predicted Performance Conventional QR / MRRR (real) Restructured QR / MRRR (real) 61

62 Predicted performance (EVD) Conventional QR / MRRR (complex) Restructured QR / MRRR (complex) 62

63 Observed Performance l Target architecture: l Single core of a Dell PowerEdge R900 server l 16 megabyte L2 cache/core. l Single core peak of FLOPS. 63

64 Application of ivens rotations Theoretical Peak for dgemm dgemm Theoretical peak for ivens Kernel Observed 64

65 EVD performance (relative to netlib MRRR) MKL MRRR Netlib MRRR Restructured QR 65

66 66

67 libflame SVD Performance libflame SVD Netlib via DC 67

68 EVD Parallel Performance (24 cores) Performance on clarksville (24 cores) Standardized FLOPS LAPACK QR LAPACK DC LAPACK MRRR Ideal MRRR libflame var1 libflame var2 var2a (vertical wspace in backtrans.) var2r (outside BLAS parallelism) Infinitely fast MRRR tridiag EVD LAPACK MRRR restructured QR Matrix size 68

69 Is your favorite graph missing? l The paper has an electronic appendix with tons of performance graphs. 69

70 Overview l 50+ years of progress l The hidden costs of MRRR and D&C l QR algorithm basics l Accumulating and applying rotations l Performance l Conclusion 70

71 Conclusion l The QR algorithm lives! l Future directions: Parallelization (multi)pu Aggressive early deflation 71

Restructuring the QR Algorithm for High-Performance Application of Givens Rotations

Restructuring the QR Algorithm for High-Performance Application of Givens Rotations FLAME Working Note #60 Field G. Van Zee Robert A. van de Geijn Department of Computer Science The University of Texas