Size: px
Start display at page:

Download ""

Transcription

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 Fortran program + Partial data layout specifications Data Layout Assistant.. regular problems. dynamic remapping allowed Invoked only a few times Not part of the compiler Can use expensive techniques HPF program with Total data layout specifications Target HPF Compiler Target Machine Object Code

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO ENDDO // WRITE (c, b) c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j)

43 REAL c(n, N), a(n, N), b(n, N) REAL c(n, N), a(n, N), b(n, N) // Static column-wise layout // Dynamic row and column-wise layout!hpf$ TEMPLATE X(N, N)!HPF$ TEMPLATE X(N, N)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)!hpf$ DYNAMIC X!HPF$ DISTRIBUTE X(*, BLOCK)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)...!hpf$ DISTRIBUTE X(*, BLOCK)... DO iter = 1, max DO iter = 1, max // Forward and backward sweeps along rows // Forward and backward sweeps along rows...!hpf$ REDISTRIBUTE X(BLOCK, *)... // Downward and upward sweeps along columns // Downward and upward sweeps along columns ENDDO...!HPF$ REDISTRIBUTE X(*, BLOCK) ENDDO

44

45 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) 1 DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j) c, a, b DO 2 c, a, b c, b 3 4 c, a, b 5 c, a, b c, b 7 c, a, b c, b iter = 1, max 6 8 ENDDO // WRITE (c, b) 8 PCFG

46

47

48

49

50

51 DO i = 1, n y(i, 1) = x(i, 1) + x(1, i) ENDDO NODE Constraints Each node is in exactly one partition y 11 + y 12 =1 y y =1 y 1 x 1 x x =1 x x = y 11 y 12 y 21 y 22 y 2 CAG x x $ y $ y 11 x 2 x $ y x x 11 x $ y 11 x 21 x 22 Two dimensions of the same array must not be in the same partition y 1 + y 21 < 1 1 y 12 + y 22 < 1 x 11 + x 21 < An edge is switched on IFF the source and sink are switched on IN-constraints: x $ y x $ y < y $ y 12 x x $ y 1 2 < y 12 OUT-constraints: 11 x $ y 11 < x x $ y 11 < x 12 x < 22 $ y x 2 x $ y < x x 12 EDGE Constraints + x < 22 1

52 a 1 b 1 a 1 b 1 a 2 b 2 { a 1 b 1 a 2 b 2 } a 2 b 2 { a 1 b 2 a 2 b 1 } a 1 b 1 a 1 b 1 a 1 b 1 a 1 b 1 a 2 b 2 a 2 b 2 a 2 b 2 a 2 b 2 { a 1 b 1 a 2 b 2 } { a 1 a 2 b 2 b 1 } { a 1 a 2 b 1 b 2 } { a 1 b 2 a 2 b 1 } a 1 b 1 a 2 b 2 {a 1 a 2 b 1 b 2 }

53

54

55

56

57

58

59

60

61 p p p p k

62

63

64

65

66 TEMPLATE PROG_TEMPLATE(N, N, N) ALIGN A(I, J, K) WITH PROG_TEMPLATE(I, J, K) do 10 k = 1, N do 10 j = 1, N do 10 i = j, N 10 A(i, j, k) =... I K J A

67 1 c, a, b iter = 1, max DO P 1 P 2 P c, a, b P 4 TEMPLATE PROG (N, N) ALIGN c(i, J), a(i, J), b(i, J) WITH PROG(I, J) c, b 3 4 c, a, b 5 c, a, b PROCESSORS PROCS(8) DISTRIBUTE PROG (BLOCK, *) ONTO PROCS PROCESSORS PROCS(8) DISTRIBUTE PROG (*, BLOCK) ONTO PROCS c, b 6 7 c, a, b PROCESSORS PROCS(2, 4) PROCESSORS PROCS(4, 2) DISTRIBUTE PROG (BLOCK, BLOCK) DISTRIBUTE PROG (BLOCK, BLOCK) ONTO PROCS ONTO PROCS c, b 8 P 5... P 6... P 7... P 8... PCFG Candidate Layout Search Spaces

68

69

70

71

72

73

74

75

76

77

78 1 c, a, b iter = 1, max DO cab ca b 3T P 1 2 c, a, b cab 3T (max-1) cab P 2 c, b 3 cb 2T max T max c b P 3 4 c, a, b cab 2T max cab P 4 5 c, a, b cab 3T max cab P 5 c, b 6 cb 2T max T max c b P 6 7 c, a, b cab 2T max cab P 7 c, b 8 cb 2T c b P 8 row layout column layout remapping of c, a, and b remapping of a remapping of c and b PCFG DLG

79

80

81

82

83 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1

84 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1

85

86 DO i = 1, N 1 P 1 entry candidate layouts PCFG 2 3 N N N-1 loop structure DLG P P P exit candidate layouts loop summary DLG

87 1 P 1 entry candidate layouts 2 IF i = 1, N 3 P 2 P 3 FI 4 P 4 exit candidate layouts PCFG branch structure DLG branch summary DLG

88 entry node 1 P 1 2 P 2 3 P 3 exit node PCFG outermost DLG

89

90 P c a b x ab c x 1 2 x 1 x = 1 P 2 c a b cab x 2 1 x 2 2 x + 2 x = 1 P cb c b 3 x 3 1 x 3 2 x + x = 1 row layout column layout

91

92 x + x x 31 + x = IN constraint 2 x 4 1 P P 2 3 cab x 2 1 cab x 2 2 cb x 3 1 c b x 3 2 x x = x x 41 + x 41 = x 4 1 OUT contraint P 4 cab x 4 1 IN constraints OUT contraint x x = x 4 1 P 5 cab x 5 cab 1 x 5 2 x x = x 4 1 compact formulation row layout remapping of c, a, and b remapping of a column layout remapping of c and b disaggregated formulation

93 P 2 cab cab x 2 2 P P 3 4 cb cab x 4 1 c b cab x 22 + x 41 > 2 x x + 22 x < x row layout column layout remapping of a

94

95

96

97

98

99

100

101

102

103

104 8 x 104 Training Set for SHIFT Patterns (8 Processors) 7 Execution Time in Micro Seconds high latency, unit high latency, non unit low latency, non unit 1 low latency, unit Message Size in Bytes x 10 4

105 14 x 104 Training Set for Broadcast Pattern with Unit Stride 12 Execution Time in Micro Seconds procs 16 procs 8 procs 4 procs 2 2 procs Message Size in Bytes x 10 4

106 time in seconds double, 16 processors, 512 x 512 measured time estimated time row column transpose

107

108 Execution Time in Seconds Execution Time in Seconds Measured static row static column remapped Estimated Number of Processors

109 Execution Time in Seconds Execution Time in Seconds Measured 8 static 1. dimension static 2. dimension 6 static 3. dimension remapped Estimated Number of Processors

110 60 40 Measured static row static column remapped 20 Execution Time in Seconds Estimated (pre determined branch probabilites) Estimated (default branch probabilities) Number of Processors

111

112 Execution Time in Seconds Execution Time in Seconds Measured static row static column Estimated Number of Processors

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling. Advanced Topics Which Loops are Parallel? review Optimization for parallel machines and memory hierarchies Last Time Dependence analysis Today Loop transformations An example - McKinley, Carr, Tseng loop

More information

Saturday, April 23, Dependence Analysis

Saturday, April 23, Dependence Analysis Dependence Analysis Motivating question Can the loops on the right be run in parallel? i.e., can different processors run different iterations in parallel? What needs to be true for a loop to be parallelizable?

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1 Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt

More information

Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen

Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen Advanced Restructuring Compilers Advanced Topics Spring 2009 Prof. Robert van Engelen Overview Data and control dependences The theory and practice of data dependence analysis K-level loop-carried dependences

More information

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between

More information

Advanced Compiler Construction

Advanced Compiler Construction CS 526 Advanced Compiler Construction http://misailo.cs.illinois.edu/courses/cs526 DEPENDENCE ANALYSIS The slides adapted from Vikram Adve and David Padua Kinds of Data Dependence Direct Dependence X =

More information

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication 1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole

More information

The Data-Dependence graph of Adjoint Codes

The Data-Dependence graph of Adjoint Codes The Data-Dependence graph of Adjoint Codes Laurent Hascoët INRIA Sophia-Antipolis November 19th, 2012 (INRIA Sophia-Antipolis) Data-Deps of Adjoints 19/11/2012 1 / 14 Data-Dependence graph of this talk

More information

Section Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional)

Section Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional) Section 2.4 Section Summary Sequences. o Examples: Geometric Progression, Arithmetic Progression Recurrence Relations o Example: Fibonacci Sequence Summations Special Integer Sequences (optional) Sequences

More information

EECS 358 Introduction to Parallel Computing Final Assignment

EECS 358 Introduction to Parallel Computing Final Assignment EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed

More information

Classes of data dependence. Dependence analysis. Flow dependence (True dependence)

Classes of data dependence. Dependence analysis. Flow dependence (True dependence) Dependence analysis Classes of data dependence Pattern matching and replacement is all that is needed to apply many source-to-source transformations. For example, pattern matching can be used to determine

More information

Dependence analysis. However, it is often necessary to gather additional information to determine the correctness of a particular transformation.

Dependence analysis. However, it is often necessary to gather additional information to determine the correctness of a particular transformation. Dependence analysis Pattern matching and replacement is all that is needed to apply many source-to-source transformations. For example, pattern matching can be used to determine that the recursion removal

More information

COSE312: Compilers. Lecture 17 Intermediate Representation (2)

COSE312: Compilers. Lecture 17 Intermediate Representation (2) COSE312: Compilers Lecture 17 Intermediate Representation (2) Hakjoo Oh 2017 Spring Hakjoo Oh COSE312 2017 Spring, Lecture 17 May 31, 2017 1 / 19 Common Intermediate Representations Three-address code

More information

LINEAR SYSTEMS (11) Intensive Computation

LINEAR SYSTEMS (11) Intensive Computation LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY

More information

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 10 10 February 2009 Announcement Feb 17 th

More information

Loop Parallelization Techniques and dependence analysis

Loop Parallelization Techniques and dependence analysis Loop Parallelization Techniques and dependence analysis Data-Dependence Analysis Dependence-Removing Techniques Parallelizing Transformations Performance-enchancing Techniques 1 When can we run code in

More information

Data Dependences and Parallelization. Stanford University CS243 Winter 2006 Wei Li 1

Data Dependences and Parallelization. Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization Wei Li 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11,

More information

MIT Loop Optimizations. Martin Rinard

MIT Loop Optimizations. Martin Rinard MIT 6.035 Loop Optimizations Martin Rinard Loop Optimizations Important because lots of computation occurs in loops We will study two optimizations Loop-invariant code motion Induction variable elimination

More information

Pipelined Computations

Pipelined Computations Chapter 5 Slide 155 Pipelined Computations Pipelined Computations Slide 156 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming). Each

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

Parallel Scientific Computing

Parallel Scientific Computing IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.

More information

Lecture 6 September 21, 2016

Lecture 6 September 21, 2016 ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and

More information

LU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n...

LU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n... .. Factorizations Reading: Trefethen and Bau (1997), Lecture 0 Solve the n n linear system by Gaussian elimination Ax = b (1) { Gaussian elimination is a direct method The solution is found after a nite

More information

CS 553 Compiler Construction Fall 2006 Homework #2 Dominators, Loops, SSA, and Value Numbering

CS 553 Compiler Construction Fall 2006 Homework #2 Dominators, Loops, SSA, and Value Numbering CS 553 Compiler Construction Fall 2006 Homework #2 Dominators, Loops, SSA, and Value Numbering Answers Write your answers on another sheet of paper. Homework assignments are to be completed individually.

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

Heaps Induction. Heaps. Heaps. Tirgul 6

Heaps Induction. Heaps. Heaps. Tirgul 6 Tirgul 6 Induction A binary heap is a nearly complete binary tree stored in an array object In a max heap, the value of each node that of its children (In a min heap, the value of each node that of its

More information

Big-O Notation and Complexity Analysis

Big-O Notation and Complexity Analysis Big-O Notation and Complexity Analysis Jonathan Backer backer@cs.ubc.ca Department of Computer Science University of British Columbia May 28, 2007 Problems Reading: CLRS: Growth of Functions 3 GT: Algorithm

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10, A NOVEL DOMINO LOGIC DESIGN FOR EMBEDDED APPLICATION Dr.K.Sujatha Associate Professor, Department of Computer science and Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, Tamilnadu,

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

The purpose of computing is insight, not numbers. Richard Wesley Hamming

The purpose of computing is insight, not numbers. Richard Wesley Hamming Systems of Linear Equations The purpose of computing is insight, not numbers. Richard Wesley Hamming Fall 2010 1 Topics to Be Discussed This is a long unit and will include the following important topics:

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

' $ Dependence Analysis & % 1

' $ Dependence Analysis & % 1 Dependence Analysis 1 Goals - determine what operations can be done in parallel - determine whether the order of execution of operations can be altered Basic idea - determine a partial order on operations

More information

MATH2210 Notebook 2 Spring 2018

MATH2210 Notebook 2 Spring 2018 MATH2210 Notebook 2 Spring 2018 prepared by Professor Jenny Baglivo c Copyright 2009 2018 by Jenny A. Baglivo. All Rights Reserved. 2 MATH2210 Notebook 2 3 2.1 Matrices and Their Operations................................

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Principles of Scientific Computing Linear Algebra II, Algorithms

Principles of Scientific Computing Linear Algebra II, Algorithms Principles of Scientific Computing Linear Algebra II, Algorithms David Bindel and Jonathan Goodman last revised March 2, 2006, printed February 26, 2009 1 1 Introduction This chapter discusses some of

More information

Matrix Arithmetic. j=1

Matrix Arithmetic. j=1 An m n matrix is an array A = Matrix Arithmetic a 11 a 12 a 1n a 21 a 22 a 2n a m1 a m2 a mn of real numbers a ij An m n matrix has m rows and n columns a ij is the entry in the i-th row and j-th column

More information

CSC D70: Compiler Optimization Static Single Assignment (SSA)

CSC D70: Compiler Optimization Static Single Assignment (SSA) CSC D70: Compiler Optimization Static Single Assignment (SSA) Prof. Gennady Pekhimenko University of Toronto Winter 08 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip

More information

Lecture 10: Data Flow Analysis II

Lecture 10: Data Flow Analysis II CS 515 Programming Language and Compilers I Lecture 10: Data Flow Analysis II (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang

More information

Lecture 11: Data Flow Analysis III

Lecture 11: Data Flow Analysis III CS 515 Programming Language and Compilers I Lecture 11: Data Flow Analysis III (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang

More information

Status of HICUM/L2 Model

Status of HICUM/L2 Model Status of HICUM/L2 Model A. Pawlak 1), M. Schröter 1),2), A. Mukherjee 1) 1) CEDIC, University of Technology Dresden, Germany 2) Dept. of Electrical and Computer Engin., University of Calif. at San Diego,

More information

Linear Algebraic Equations

Linear Algebraic Equations Linear Algebraic Equations Linear Equations: a + a + a + a +... + a = c 11 1 12 2 13 3 14 4 1n n 1 a + a + a + a +... + a = c 21 2 2 23 3 24 4 2n n 2 a + a + a + a +... + a = c 31 1 32 2 33 3 34 4 3n n

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence Section 2.4 1 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations 2 Introduction Sequences are ordered lists of

More information

Recap: Prefix Sums. Given A: set of n integers Find B: prefix sums 1 / 86

Recap: Prefix Sums. Given A: set of n integers Find B: prefix sums 1 / 86 Recap: Prefix Sums Given : set of n integers Find B: prefix sums : 3 1 1 7 2 5 9 2 4 3 3 B: 3 4 5 12 14 19 28 30 34 37 40 1 / 86 Recap: Parallel Prefix Sums Recursive algorithm Recursively computes sums

More information

Parallel Prefix Algorithms 1. A Secret to turning serial into parallel

Parallel Prefix Algorithms 1. A Secret to turning serial into parallel Parallel Prefix Algorithms. A Secret to turning serial into parallel 2. Suppose you bump into a parallel algorithm that surprises you there is no way to parallelize this algorithm you say 3. Probably a

More information

Chapter 8 Gauss Elimination. Gab-Byung Chae

Chapter 8 Gauss Elimination. Gab-Byung Chae Chapter 8 Gauss Elimination Gab-Byung Chae 2008 5 19 2 Chapter Objectives How to solve small sets of linear equations with the graphical method and Cramer s rule Gauss Elimination Understanding how to

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

MATH 3511 Lecture 1. Solving Linear Systems 1

MATH 3511 Lecture 1. Solving Linear Systems 1 MATH 3511 Lecture 1 Solving Linear Systems 1 Dmitriy Leykekhman Spring 2012 Goals Review of basic linear algebra Solution of simple linear systems Gaussian elimination D Leykekhman - MATH 3511 Introduction

More information

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems John von Neumann Institute for Computing Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems Darren J. Kerbyson, Kevin J. Barker, Kei Davis published in Parallel Computing:

More information

Review : Powers of a matrix

Review : Powers of a matrix Review : Powers of a matrix Given a square matrix A and a positive integer k, we define A k = AA A } {{ } k times Note that the multiplications AA, AAA,... make sense. Example. Suppose A=. Then A 0 2 =

More information

A Polynomial-Time Algorithm for Memory Space Reduction

A Polynomial-Time Algorithm for Memory Space Reduction A Polynomial-Time Algorithm for Memory Space Reduction Yonghong Song Cheng Wang Zhiyuan Li Sun Microsystems, Inc. Department of Computer Sciences 4150 Network Circle Purdue University Santa Clara, CA 95054

More information

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions September 12, 2018 I. The Work-Time W-T presentation of EREW sequence reduction Algorithm 2 in the PRAM handout has work complexity

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

COMP 515: Advanced Compilation for Vector and Parallel Processors

COMP 515: Advanced Compilation for Vector and Parallel Processors COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 4 1 September, 2011 1 Acknowledgments Slides

More information

Introduc)on to linear algebra

Introduc)on to linear algebra Introduc)on to linear algebra Vector A vector, v, of dimension n is an n 1 rectangular array of elements v 1 v v = 2 " v n % vectors will be column vectors. They may also be row vectors, when transposed

More information

Homework #2: assignments/ps2/ps2.htm Due Thursday, March 7th.

Homework #2:   assignments/ps2/ps2.htm Due Thursday, March 7th. Homework #2: http://www.cs.cornell.edu/courses/cs612/2002sp/ assignments/ps2/ps2.htm Due Thursday, March 7th. 1 Transformations and Dependences 2 Recall: Polyhedral algebra tools for determining emptiness

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion Improving Memory Hierarchy Performance Through Combined Loop Interchange and Multi-Level Fusion Qing Yi Ken Kennedy Computer Science Department, Rice University MS-132 Houston, TX 77005 Abstract Because

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

Section Summary. Definition of a Function.

Section Summary. Definition of a Function. Section 2.3 Section Summary Definition of a Function. Domain, Codomain Image, Preimage Injection, Surjection, Bijection Inverse Function Function Composition Graphing Functions Floor, Ceiling, Factorial

More information

Reverse Edge Cut-Set Bounds for Secure Network Coding

Reverse Edge Cut-Set Bounds for Secure Network Coding Reverse Edge Cut-Set Bounds for Secure Network Coding Wentao Huang and Tracey Ho California Institute of Technology Michael Langberg University at Buffalo, SUNY Joerg Kliewer New Jersey Institute of Technology

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

CIS 4930/6930: Principles of Cyber-Physical Systems

CIS 4930/6930: Principles of Cyber-Physical Systems CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 11 Scheduling Hao Zheng Department of Computer Science and Engineering University of South Florida H. Zheng (CSE USF) CIS 4930/6930: Principles

More information

Tensors and n-d Arrays: A Mathematics of Arrays (MoA) and the ψ-calculus

Tensors and n-d Arrays: A Mathematics of Arrays (MoA) and the ψ-calculus Tensors and n-d Arrays: A Mathematics of Arrays (MoA) and the ψ-calculus Composition of Tensor and Array Operations Lenore M. Mullin and James E. Raynolds 0 Message of This Talk An algebra of multi-dimensional

More information

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution

More information

DM559 Linear and Integer Programming. Lecture 3 Matrix Operations. Marco Chiarandini

DM559 Linear and Integer Programming. Lecture 3 Matrix Operations. Marco Chiarandini DM559 Linear and Integer Programming Lecture 3 Matrix Operations Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline and 1 2 3 and 4 2 Outline and 1 2

More information

Fall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop

Fall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop Fall 2017 Inverse of a matrix Authors: Alexander Knop Institute: UC San Diego Row-Column Rule If the product AB is defined, then the entry in row i and column j of AB is the sum of the products of corresponding

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Compiler Design Spring 2017

Compiler Design Spring 2017 Compiler Design Spring 2017 8.6 Live variables Dr. Zoltán Majó Compiler Group Java HotSpot Virtual Machine Oracle Corporation Last lecture Definition: A variable V is live at point P if there is a path

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #19 3/28/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class PRAM

More information

Static Analysis of Programs: A Heap-Centric View

Static Analysis of Programs: A Heap-Centric View Static Analysis of Programs: A Heap-Centric View (www.cse.iitb.ac.in/ uday) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay 5 April 2008 Part 1 Introduction ETAPS

More information

Analyses of Energy Consumption Changes by Loop Transformations in Log Blocks-based FTL

Analyses of Energy Consumption Changes by Loop Transformations in Log Blocks-based FTL Analyses of Energy Consumption Changes by Loop Transformations in Log Blocks-based FTL Memory Architecture and Organization Workshop 2013 (MeAOW 2013) 2013. 10. 3 Joon-Young Paik*, Tae-Sun Chung**, Eun-Sun

More information

CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms

CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Matrix representation of a linear map

Matrix representation of a linear map Matrix representation of a linear map As before, let e i = (0,..., 0, 1, 0,..., 0) T, with 1 in the i th place and 0 elsewhere, be standard basis vectors. Given linear map f : R n R m we get n column vectors

More information

4th year Project demo presentation

4th year Project demo presentation 4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The

More information

3 Matrix Algebra. 3.1 Operations on matrices

3 Matrix Algebra. 3.1 Operations on matrices 3 Matrix Algebra A matrix is a rectangular array of numbers; it is of size m n if it has m rows and n columns. A 1 n matrix is a row vector; an m 1 matrix is a column vector. For example: 1 5 3 5 3 5 8

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

Lecture 2: Divide and conquer and Dynamic programming

Lecture 2: Divide and conquer and Dynamic programming Chapter 2 Lecture 2: Divide and conquer and Dynamic programming 2.1 Divide and Conquer Idea: - divide the problem into subproblems in linear time - solve subproblems recursively - combine the results in

More information

Organization of a Modern Compiler. Middle1

Organization of a Modern Compiler. Middle1 Organization of a Modern Compiler Source Program Front-end syntax analysis + type-checking + symbol table High-level Intermediate Representation (loops,array references are preserved) Middle1 loop-level

More information

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By

More information

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo Dependence Analysis Dependence Examples Last Time: Brief introduction to interprocedural analysis Today: Optimization for parallel machines and memory hierarchies Dependence analysis Loop transformations

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Finite Difference Methods (FDMs) 1

Finite Difference Methods (FDMs) 1 Finite Difference Methods (FDMs) 1 1 st - order Approxima9on Recall Taylor series expansion: Forward difference: Backward difference: Central difference: 2 nd - order Approxima9on Forward difference: Backward

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

Math 552 Scientific Computing II Spring SOLUTIONS: Homework Set 1

Math 552 Scientific Computing II Spring SOLUTIONS: Homework Set 1 Math 552 Scientific Computing II Spring 21 SOLUTIONS: Homework Set 1 ( ) a b 1 Let A be the 2 2 matrix A = By hand, use Gaussian elimination with back c d substitution to obtain A 1 by solving the two

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Dataflow Analysis - 2. Monotone Dataflow Frameworks

Dataflow Analysis - 2. Monotone Dataflow Frameworks Dataflow Analysis - 2 Monotone dataflow frameworks Definition Convergence Safety Relation of MOP to MFP Constant propagation Categorization of dataflow problems DataflowAnalysis 2, Sp06 BGRyder 1 Monotone

More information

A Review of Matrix Analysis

A Review of Matrix Analysis Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value

More information

A Matrix Method for Efficient Computation of Bernstein Coefficients

A Matrix Method for Efficient Computation of Bernstein Coefficients A Matrix Method for Efficient Computation of Bernstein Coefficients Shashwati Ray Systems and Control Engineering Group, Room 114A, ACRE Building, Indian Institute of Technology, Bombay, India 400 076

More information