|
|
- Rhoda Marshall
- 6 years ago
- Views:
Transcription
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 Fortran program + Partial data layout specifications Data Layout Assistant.. regular problems. dynamic remapping allowed Invoked only a few times Not part of the compiler Can use expensive techniques HPF program with Total data layout specifications Target HPF Compiler Target Machine Object Code
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO ENDDO // WRITE (c, b) c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j)
43 REAL c(n, N), a(n, N), b(n, N) REAL c(n, N), a(n, N), b(n, N) // Static column-wise layout // Dynamic row and column-wise layout!hpf$ TEMPLATE X(N, N)!HPF$ TEMPLATE X(N, N)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)!hpf$ DYNAMIC X!HPF$ DISTRIBUTE X(*, BLOCK)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)...!hpf$ DISTRIBUTE X(*, BLOCK)... DO iter = 1, max DO iter = 1, max // Forward and backward sweeps along rows // Forward and backward sweeps along rows...!hpf$ REDISTRIBUTE X(BLOCK, *)... // Downward and upward sweeps along columns // Downward and upward sweeps along columns ENDDO...!HPF$ REDISTRIBUTE X(*, BLOCK) ENDDO
44
45 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) 1 DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j) c, a, b DO 2 c, a, b c, b 3 4 c, a, b 5 c, a, b c, b 7 c, a, b c, b iter = 1, max 6 8 ENDDO // WRITE (c, b) 8 PCFG
46
47
48
49
50
51 DO i = 1, n y(i, 1) = x(i, 1) + x(1, i) ENDDO NODE Constraints Each node is in exactly one partition y 11 + y 12 =1 y y =1 y 1 x 1 x x =1 x x = y 11 y 12 y 21 y 22 y 2 CAG x x $ y $ y 11 x 2 x $ y x x 11 x $ y 11 x 21 x 22 Two dimensions of the same array must not be in the same partition y 1 + y 21 < 1 1 y 12 + y 22 < 1 x 11 + x 21 < An edge is switched on IFF the source and sink are switched on IN-constraints: x $ y x $ y < y $ y 12 x x $ y 1 2 < y 12 OUT-constraints: 11 x $ y 11 < x x $ y 11 < x 12 x < 22 $ y x 2 x $ y < x x 12 EDGE Constraints + x < 22 1
52 a 1 b 1 a 1 b 1 a 2 b 2 { a 1 b 1 a 2 b 2 } a 2 b 2 { a 1 b 2 a 2 b 1 } a 1 b 1 a 1 b 1 a 1 b 1 a 1 b 1 a 2 b 2 a 2 b 2 a 2 b 2 a 2 b 2 { a 1 b 1 a 2 b 2 } { a 1 a 2 b 2 b 1 } { a 1 a 2 b 1 b 2 } { a 1 b 2 a 2 b 1 } a 1 b 1 a 2 b 2 {a 1 a 2 b 1 b 2 }
53
54
55
56
57
58
59
60
61 p p p p k
62
63
64
65
66 TEMPLATE PROG_TEMPLATE(N, N, N) ALIGN A(I, J, K) WITH PROG_TEMPLATE(I, J, K) do 10 k = 1, N do 10 j = 1, N do 10 i = j, N 10 A(i, j, k) =... I K J A
67 1 c, a, b iter = 1, max DO P 1 P 2 P c, a, b P 4 TEMPLATE PROG (N, N) ALIGN c(i, J), a(i, J), b(i, J) WITH PROG(I, J) c, b 3 4 c, a, b 5 c, a, b PROCESSORS PROCS(8) DISTRIBUTE PROG (BLOCK, *) ONTO PROCS PROCESSORS PROCS(8) DISTRIBUTE PROG (*, BLOCK) ONTO PROCS c, b 6 7 c, a, b PROCESSORS PROCS(2, 4) PROCESSORS PROCS(4, 2) DISTRIBUTE PROG (BLOCK, BLOCK) DISTRIBUTE PROG (BLOCK, BLOCK) ONTO PROCS ONTO PROCS c, b 8 P 5... P 6... P 7... P 8... PCFG Candidate Layout Search Spaces
68
69
70
71
72
73
74
75
76
77
78 1 c, a, b iter = 1, max DO cab ca b 3T P 1 2 c, a, b cab 3T (max-1) cab P 2 c, b 3 cb 2T max T max c b P 3 4 c, a, b cab 2T max cab P 4 5 c, a, b cab 3T max cab P 5 c, b 6 cb 2T max T max c b P 6 7 c, a, b cab 2T max cab P 7 c, b 8 cb 2T c b P 8 row layout column layout remapping of c, a, and b remapping of a remapping of c and b PCFG DLG
79
80
81
82
83 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1
84 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1
85
86 DO i = 1, N 1 P 1 entry candidate layouts PCFG 2 3 N N N-1 loop structure DLG P P P exit candidate layouts loop summary DLG
87 1 P 1 entry candidate layouts 2 IF i = 1, N 3 P 2 P 3 FI 4 P 4 exit candidate layouts PCFG branch structure DLG branch summary DLG
88 entry node 1 P 1 2 P 2 3 P 3 exit node PCFG outermost DLG
89
90 P c a b x ab c x 1 2 x 1 x = 1 P 2 c a b cab x 2 1 x 2 2 x + 2 x = 1 P cb c b 3 x 3 1 x 3 2 x + x = 1 row layout column layout
91
92 x + x x 31 + x = IN constraint 2 x 4 1 P P 2 3 cab x 2 1 cab x 2 2 cb x 3 1 c b x 3 2 x x = x x 41 + x 41 = x 4 1 OUT contraint P 4 cab x 4 1 IN constraints OUT contraint x x = x 4 1 P 5 cab x 5 cab 1 x 5 2 x x = x 4 1 compact formulation row layout remapping of c, a, and b remapping of a column layout remapping of c and b disaggregated formulation
93 P 2 cab cab x 2 2 P P 3 4 cb cab x 4 1 c b cab x 22 + x 41 > 2 x x + 22 x < x row layout column layout remapping of a
94
95
96
97
98
99
100
101
102
103
104 8 x 104 Training Set for SHIFT Patterns (8 Processors) 7 Execution Time in Micro Seconds high latency, unit high latency, non unit low latency, non unit 1 low latency, unit Message Size in Bytes x 10 4
105 14 x 104 Training Set for Broadcast Pattern with Unit Stride 12 Execution Time in Micro Seconds procs 16 procs 8 procs 4 procs 2 2 procs Message Size in Bytes x 10 4
106 time in seconds double, 16 processors, 512 x 512 measured time estimated time row column transpose
107
108 Execution Time in Seconds Execution Time in Seconds Measured static row static column remapped Estimated Number of Processors
109 Execution Time in Seconds Execution Time in Seconds Measured 8 static 1. dimension static 2. dimension 6 static 3. dimension remapped Estimated Number of Processors
110 60 40 Measured static row static column remapped 20 Execution Time in Seconds Estimated (pre determined branch probabilites) Estimated (default branch probabilities) Number of Processors
111
112 Execution Time in Seconds Execution Time in Seconds Measured static row static column Estimated Number of Processors
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.
Advanced Topics Which Loops are Parallel? review Optimization for parallel machines and memory hierarchies Last Time Dependence analysis Today Loop transformations An example - McKinley, Carr, Tseng loop
More informationSaturday, April 23, Dependence Analysis
Dependence Analysis Motivating question Can the loops on the right be run in parallel? i.e., can different processors run different iterations in parallel? What needs to be true for a loop to be parallelizable?
More informationSpecial Nodes for Interface
fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster
More informationLoop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1
Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt
More informationAdvanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen
Advanced Restructuring Compilers Advanced Topics Spring 2009 Prof. Robert van Engelen Overview Data and control dependences The theory and practice of data dependence analysis K-level loop-carried dependences
More informationAnnouncements PA2 due Friday Midterm is Wednesday next week, in class, one week from today
Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between
More informationAdvanced Compiler Construction
CS 526 Advanced Compiler Construction http://misailo.cs.illinois.edu/courses/cs526 DEPENDENCE ANALYSIS The slides adapted from Vikram Adve and David Padua Kinds of Data Dependence Direct Dependence X =
More informationCEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication
1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole
More informationThe Data-Dependence graph of Adjoint Codes
The Data-Dependence graph of Adjoint Codes Laurent Hascoët INRIA Sophia-Antipolis November 19th, 2012 (INRIA Sophia-Antipolis) Data-Deps of Adjoints 19/11/2012 1 / 14 Data-Dependence graph of this talk
More informationSection Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional)
Section 2.4 Section Summary Sequences. o Examples: Geometric Progression, Arithmetic Progression Recurrence Relations o Example: Fibonacci Sequence Summations Special Integer Sequences (optional) Sequences
More informationEECS 358 Introduction to Parallel Computing Final Assignment
EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed
More informationClasses of data dependence. Dependence analysis. Flow dependence (True dependence)
Dependence analysis Classes of data dependence Pattern matching and replacement is all that is needed to apply many source-to-source transformations. For example, pattern matching can be used to determine
More informationDependence analysis. However, it is often necessary to gather additional information to determine the correctness of a particular transformation.
Dependence analysis Pattern matching and replacement is all that is needed to apply many source-to-source transformations. For example, pattern matching can be used to determine that the recursion removal
More informationCOSE312: Compilers. Lecture 17 Intermediate Representation (2)
COSE312: Compilers Lecture 17 Intermediate Representation (2) Hakjoo Oh 2017 Spring Hakjoo Oh COSE312 2017 Spring, Lecture 17 May 31, 2017 1 / 19 Common Intermediate Representations Three-address code
More informationLINEAR SYSTEMS (11) Intensive Computation
LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY
More informationCOMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University
COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 10 10 February 2009 Announcement Feb 17 th
More informationLoop Parallelization Techniques and dependence analysis
Loop Parallelization Techniques and dependence analysis Data-Dependence Analysis Dependence-Removing Techniques Parallelizing Transformations Performance-enchancing Techniques 1 When can we run code in
More informationData Dependences and Parallelization. Stanford University CS243 Winter 2006 Wei Li 1
Data Dependences and Parallelization Wei Li 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11,
More informationMIT Loop Optimizations. Martin Rinard
MIT 6.035 Loop Optimizations Martin Rinard Loop Optimizations Important because lots of computation occurs in loops We will study two optimizations Loop-invariant code motion Induction variable elimination
More informationPipelined Computations
Chapter 5 Slide 155 Pipelined Computations Pipelined Computations Slide 156 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming). Each
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationIntroduction The Nature of High-Performance Computation
1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential
More informationLinear System of Equations
Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.
More informationParallel Scientific Computing
IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.
More informationLecture 6 September 21, 2016
ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and
More informationLU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n...
.. Factorizations Reading: Trefethen and Bau (1997), Lecture 0 Solve the n n linear system by Gaussian elimination Ax = b (1) { Gaussian elimination is a direct method The solution is found after a nite
More informationCS 553 Compiler Construction Fall 2006 Homework #2 Dominators, Loops, SSA, and Value Numbering
CS 553 Compiler Construction Fall 2006 Homework #2 Dominators, Loops, SSA, and Value Numbering Answers Write your answers on another sheet of paper. Homework assignments are to be completed individually.
More informationA New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem
A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence
More informationA Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding
A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,
More informationDSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.
SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way
More informationHeaps Induction. Heaps. Heaps. Tirgul 6
Tirgul 6 Induction A binary heap is a nearly complete binary tree stored in an array object In a max heap, the value of each node that of its children (In a min heap, the value of each node that of its
More informationBig-O Notation and Complexity Analysis
Big-O Notation and Complexity Analysis Jonathan Backer backer@cs.ubc.ca Department of Computer Science University of British Columbia May 28, 2007 Problems Reading: CLRS: Growth of Functions 3 GT: Algorithm
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,
A NOVEL DOMINO LOGIC DESIGN FOR EMBEDDED APPLICATION Dr.K.Sujatha Associate Professor, Department of Computer science and Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, Tamilnadu,
More informationParallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata
Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions
More informationThe purpose of computing is insight, not numbers. Richard Wesley Hamming
Systems of Linear Equations The purpose of computing is insight, not numbers. Richard Wesley Hamming Fall 2010 1 Topics to Be Discussed This is a long unit and will include the following important topics:
More informationHYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni
HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may
More information' $ Dependence Analysis & % 1
Dependence Analysis 1 Goals - determine what operations can be done in parallel - determine whether the order of execution of operations can be altered Basic idea - determine a partial order on operations
More informationMATH2210 Notebook 2 Spring 2018
MATH2210 Notebook 2 Spring 2018 prepared by Professor Jenny Baglivo c Copyright 2009 2018 by Jenny A. Baglivo. All Rights Reserved. 2 MATH2210 Notebook 2 3 2.1 Matrices and Their Operations................................
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationPrinciples of Scientific Computing Linear Algebra II, Algorithms
Principles of Scientific Computing Linear Algebra II, Algorithms David Bindel and Jonathan Goodman last revised March 2, 2006, printed February 26, 2009 1 1 Introduction This chapter discusses some of
More informationMatrix Arithmetic. j=1
An m n matrix is an array A = Matrix Arithmetic a 11 a 12 a 1n a 21 a 22 a 2n a m1 a m2 a mn of real numbers a ij An m n matrix has m rows and n columns a ij is the entry in the i-th row and j-th column
More informationCSC D70: Compiler Optimization Static Single Assignment (SSA)
CSC D70: Compiler Optimization Static Single Assignment (SSA) Prof. Gennady Pekhimenko University of Toronto Winter 08 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip
More informationLecture 10: Data Flow Analysis II
CS 515 Programming Language and Compilers I Lecture 10: Data Flow Analysis II (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang
More informationLecture 11: Data Flow Analysis III
CS 515 Programming Language and Compilers I Lecture 11: Data Flow Analysis III (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang
More informationStatus of HICUM/L2 Model
Status of HICUM/L2 Model A. Pawlak 1), M. Schröter 1),2), A. Mukherjee 1) 1) CEDIC, University of Technology Dresden, Germany 2) Dept. of Electrical and Computer Engin., University of Calif. at San Diego,
More informationLinear Algebraic Equations
Linear Algebraic Equations Linear Equations: a + a + a + a +... + a = c 11 1 12 2 13 3 14 4 1n n 1 a + a + a + a +... + a = c 21 2 2 23 3 24 4 2n n 2 a + a + a + a +... + a = c 31 1 32 2 33 3 34 4 3n n
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationSection Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence
Section 2.4 1 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations 2 Introduction Sequences are ordered lists of
More informationRecap: Prefix Sums. Given A: set of n integers Find B: prefix sums 1 / 86
Recap: Prefix Sums Given : set of n integers Find B: prefix sums : 3 1 1 7 2 5 9 2 4 3 3 B: 3 4 5 12 14 19 28 30 34 37 40 1 / 86 Recap: Parallel Prefix Sums Recursive algorithm Recursively computes sums
More informationParallel Prefix Algorithms 1. A Secret to turning serial into parallel
Parallel Prefix Algorithms. A Secret to turning serial into parallel 2. Suppose you bump into a parallel algorithm that surprises you there is no way to parallelize this algorithm you say 3. Probably a
More informationChapter 8 Gauss Elimination. Gab-Byung Chae
Chapter 8 Gauss Elimination Gab-Byung Chae 2008 5 19 2 Chapter Objectives How to solve small sets of linear equations with the graphical method and Cramer s rule Gauss Elimination Understanding how to
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationMATH 3511 Lecture 1. Solving Linear Systems 1
MATH 3511 Lecture 1 Solving Linear Systems 1 Dmitriy Leykekhman Spring 2012 Goals Review of basic linear algebra Solution of simple linear systems Gaussian elimination D Leykekhman - MATH 3511 Introduction
More informationAnalysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems
John von Neumann Institute for Computing Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems Darren J. Kerbyson, Kevin J. Barker, Kei Davis published in Parallel Computing:
More informationReview : Powers of a matrix
Review : Powers of a matrix Given a square matrix A and a positive integer k, we define A k = AA A } {{ } k times Note that the multiplications AA, AAA,... make sense. Example. Suppose A=. Then A 0 2 =
More informationA Polynomial-Time Algorithm for Memory Space Reduction
A Polynomial-Time Algorithm for Memory Space Reduction Yonghong Song Cheng Wang Zhiyuan Li Sun Microsystems, Inc. Department of Computer Sciences 4150 Network Circle Purdue University Santa Clara, CA 95054
More informationCOMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions
COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions September 12, 2018 I. The Work-Time W-T presentation of EREW sequence reduction Algorithm 2 in the PRAM handout has work complexity
More informationSolution of Linear Systems
Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing
More informationCOMP 515: Advanced Compilation for Vector and Parallel Processors
COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 4 1 September, 2011 1 Acknowledgments Slides
More informationIntroduc)on to linear algebra
Introduc)on to linear algebra Vector A vector, v, of dimension n is an n 1 rectangular array of elements v 1 v v = 2 " v n % vectors will be column vectors. They may also be row vectors, when transposed
More informationHomework #2: assignments/ps2/ps2.htm Due Thursday, March 7th.
Homework #2: http://www.cs.cornell.edu/courses/cs612/2002sp/ assignments/ps2/ps2.htm Due Thursday, March 7th. 1 Transformations and Dependences 2 Recall: Polyhedral algebra tools for determining emptiness
More informationNumerical Linear Algebra
Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation
More informationImproving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion
Improving Memory Hierarchy Performance Through Combined Loop Interchange and Multi-Level Fusion Qing Yi Ken Kennedy Computer Science Department, Rice University MS-132 Houston, TX 77005 Abstract Because
More informationCS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/
CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring
More informationSection Summary. Definition of a Function.
Section 2.3 Section Summary Definition of a Function. Domain, Codomain Image, Preimage Injection, Surjection, Bijection Inverse Function Function Composition Graphing Functions Floor, Ceiling, Factorial
More informationReverse Edge Cut-Set Bounds for Secure Network Coding
Reverse Edge Cut-Set Bounds for Secure Network Coding Wentao Huang and Tracey Ho California Institute of Technology Michael Langberg University at Buffalo, SUNY Joerg Kliewer New Jersey Institute of Technology
More informationLinear System of Equations
Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.
More informationCIS 4930/6930: Principles of Cyber-Physical Systems
CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 11 Scheduling Hao Zheng Department of Computer Science and Engineering University of South Florida H. Zheng (CSE USF) CIS 4930/6930: Principles
More informationTensors and n-d Arrays: A Mathematics of Arrays (MoA) and the ψ-calculus
Tensors and n-d Arrays: A Mathematics of Arrays (MoA) and the ψ-calculus Composition of Tensor and Array Operations Lenore M. Mullin and James E. Raynolds 0 Message of This Talk An algebra of multi-dimensional
More informationCS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University
CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution
More informationDM559 Linear and Integer Programming. Lecture 3 Matrix Operations. Marco Chiarandini
DM559 Linear and Integer Programming Lecture 3 Matrix Operations Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline and 1 2 3 and 4 2 Outline and 1 2
More informationFall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop
Fall 2017 Inverse of a matrix Authors: Alexander Knop Institute: UC San Diego Row-Column Rule If the product AB is defined, then the entry in row i and column j of AB is the sum of the products of corresponding
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationCompiler Design Spring 2017
Compiler Design Spring 2017 8.6 Live variables Dr. Zoltán Majó Compiler Group Java HotSpot Virtual Machine Oracle Corporation Last lecture Definition: A variable V is live at point P if there is a path
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #19 3/28/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class PRAM
More informationStatic Analysis of Programs: A Heap-Centric View
Static Analysis of Programs: A Heap-Centric View (www.cse.iitb.ac.in/ uday) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay 5 April 2008 Part 1 Introduction ETAPS
More informationAnalyses of Energy Consumption Changes by Loop Transformations in Log Blocks-based FTL
Analyses of Energy Consumption Changes by Loop Transformations in Log Blocks-based FTL Memory Architecture and Organization Workshop 2013 (MeAOW 2013) 2013. 10. 3 Joon-Young Paik*, Tae-Sun Chung**, Eun-Sun
More informationCS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms
CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationMatrix representation of a linear map
Matrix representation of a linear map As before, let e i = (0,..., 0, 1, 0,..., 0) T, with 1 in the i th place and 0 elsewhere, be standard basis vectors. Given linear map f : R n R m we get n column vectors
More information4th year Project demo presentation
4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The
More information3 Matrix Algebra. 3.1 Operations on matrices
3 Matrix Algebra A matrix is a rectangular array of numbers; it is of size m n if it has m rows and n columns. A 1 n matrix is a row vector; an m 1 matrix is a column vector. For example: 1 5 3 5 3 5 8
More informationLecture 19. Architectural Directions
Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:
More informationLecture 2: Divide and conquer and Dynamic programming
Chapter 2 Lecture 2: Divide and conquer and Dynamic programming 2.1 Divide and Conquer Idea: - divide the problem into subproblems in linear time - solve subproblems recursively - combine the results in
More informationOrganization of a Modern Compiler. Middle1
Organization of a Modern Compiler Source Program Front-end syntax analysis + type-checking + symbol table High-level Intermediate Representation (loops,array references are preserved) Middle1 loop-level
More informationREDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS
REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By
More informationA Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*
A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationDependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo
Dependence Analysis Dependence Examples Last Time: Brief introduction to interprocedural analysis Today: Optimization for parallel machines and memory hierarchies Dependence analysis Loop transformations
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationFinite Difference Methods (FDMs) 1
Finite Difference Methods (FDMs) 1 1 st - order Approxima9on Recall Taylor series expansion: Forward difference: Backward difference: Central difference: 2 nd - order Approxima9on Forward difference: Backward
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationMath 552 Scientific Computing II Spring SOLUTIONS: Homework Set 1
Math 552 Scientific Computing II Spring 21 SOLUTIONS: Homework Set 1 ( ) a b 1 Let A be the 2 2 matrix A = By hand, use Gaussian elimination with back c d substitution to obtain A 1 by solving the two
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationDataflow Analysis - 2. Monotone Dataflow Frameworks
Dataflow Analysis - 2 Monotone dataflow frameworks Definition Convergence Safety Relation of MOP to MFP Constant propagation Categorization of dataflow problems DataflowAnalysis 2, Sp06 BGRyder 1 Monotone
More informationA Review of Matrix Analysis
Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value
More informationA Matrix Method for Efficient Computation of Bernstein Coefficients
A Matrix Method for Efficient Computation of Bernstein Coefficients Shashwati Ray Systems and Control Engineering Group, Room 114A, ACRE Building, Indian Institute of Technology, Bombay, India 400 076
More information