CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Similar documents
11 Parallel programming models

CEE 618 Scientific Parallel Computing (Lecture 3)

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

The Thermoflow60 Finite-Element. Program

FORCE ENERGY. only if F = F(r). 1 Nano-scale (10 9 m) 2 Nano to Micro-scale (10 6 m) 3 Nano to Meso-scale (10 3 m)

CEE 618 Scientific Parallel Computing (Lecture 12)


CPU scheduling. CPU Scheduling

CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI)

Divisible Load Scheduling

Clock-driven scheduling

Parallel Programming in C with MPI and OpenMP

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Porting a sphere optimization program from LAPACK to ScaLAPACK

How to deal with uncertainties and dynamicity?

On Two Class-Constrained Versions of the Multiple Knapsack Problem

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Program Performance Metrics

Special Nodes for Interface

A comparison of sequencing formulations in a constraint generation procedure for avionics scheduling

Panorama des modèles et outils de programmation parallèle

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

CPU SCHEDULING RONG ZHENG

6810 Session 6 (last revised: January 28, 2017) 6 1

VLSI Signal Processing

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

2/5/07 CSE 30341: Operating Systems Principles

Overview: Synchronous Computations

Knowledge Discovery and Data Mining 1 (VO) ( )

Parallelization of the QC-lib Quantum Computer Simulator Library

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Dynamic resource sharing

Lecture 2: Divide and conquer and Dynamic programming

Chapter 4 Deliberation with Temporal Domain Models

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

EE263 Review Session 1

Lecture 6. Real-Time Systems. Dynamic Priority Scheduling

THE COMMUNICATIONS COMPLEXITY HIERARCHY IN DISTRIBUTED COMPUTING. J. B. Sidney + and J. Urrutia*

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010

Parallel Sparse Matrix - Vector Product

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Lecture 19. Architectural Directions

Scheduling Periodic Real-Time Tasks on Uniprocessor Systems. LS 12, TU Dortmund

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

"C:\Program Files\ANSYS Inc\v190\CFX\bin\perllib\cfx5solve.pl" -batch -ccl runinput.ccl -fullname "Fluid Flow CFX_002"

Finite difference methods. Finite difference methods p. 1

CSE 380 Computer Operating Systems

These are special traffic patterns that create more stress on a switch

Performance and Scalability. Lars Karlsson

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

Divisible load theory

Parallelization of the QC-lib Quantum Computer Simulator Library

2 ffl Construct a ve r s i o n o f t h e F ran-wolfe method and solve this problem starting from the point ( x1; x 2 ) = (0 :5; 3:0). ffl Plot the ite

Greedy Algorithms. Kleinberg and Tardos, Chapter 4

Operating Systems. VII. Synchronization

arxiv: v1 [hep-lat] 7 Oct 2010

Real-Time Scheduling and Resource Management

Timing Results of a Parallel FFTsynth

Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures

Cyclops Tensor Framework

TitriSoft 2.5. Content

Preconditioned Parallel Block Jacobi SVD Algorithm

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Recoverable Robustness in Scheduling Problems

Lecture 4 Implementing material models: using usermat.f. Implementing User-Programmable Features (UPFs) in ANSYS ANSYS, Inc.

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

CDA6530: Performance Models of Computers and Networks. Chapter 8: Discrete Event Simulation (DES)

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Lecture 4. Writing parallel programs with MPI Measuring performance

CS425: Algorithms for Web Scale Data

250 (headphones list price) (speaker set s list price) 14 5 apply ( = 14 5-off-60 store coupons) 60 (shopping cart coupon) = 720.

SDS developer guide. Develop distributed and parallel applications in Java. Nathanaël Cottin. version

Exam Spring Embedded Systems. Prof. L. Thiele

Proportional Share Resource Allocation Outline. Proportional Share Resource Allocation Concept

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Understanding Supernovae with Condor

Research Article Scheduling Algorithm: Tasks Scheduling Algorithm for Multiple Processors with Dynamic Reassignment

The ATLAS Run 2 Trigger: Design, Menu, Performance and Operational Aspects

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

Adaptive Spike-Based Solver 1.0 User Guide

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Weighted Activity Selection

Efficient algorithms for symmetric tensor contractions

Motors Automation Energy Transmission & Distribution Coatings. Servo Drive SCA06 V1.5X. Addendum to the Programming Manual SCA06 V1.

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Inverse problems. High-order optimization and parallel computing. Lecture 7

Computer Architecture

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5

Design and Performance Evaluation of a New Proposed Shortest Remaining Burst Round Robin (SRBRR) Scheduling Algorithm

Transcription:

1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole Street, Holmes 383, Honolulu, Hawaii 96822

2 / 26 Table of Contents 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

3 / 26 Outline Parallel computing terms 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

4 / 26 Latency Parallel computing terms The time it takes to send an empty message over the communication medium, from the time the send routine is called to the time the empty message is received by the recipient. Programs that generate large numbers of small messages are sensitive to the latency and are called latency bound programs. 1 Network latency = the time to make a message transfer through the network 2 Communication latency = the total time to send th message, including the software overhead and interface delays. 3 Message latency, or startup latency = the tie to send a zero message, which is essentially the software and hardware overhead in sending a message (finding the route, pakcing, unpacking, ets.) onto which must be added the actual time to send the data along the interconnection path.

5 / 26 Bandwidth Parallel computing terms The capacity of a system, usually expressed as items per second. In parallel computing, the most common usages of the term bandwidth is in reference to [the number of bytes per second] that can be moved across a network link. A parallel program that generates relatively small numbers of huge messages may be limited by the bandwidth of the network in which case it is called a bandwidth limited program.

6 / 26 Outline OpenMP details 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

7 / 26 opi1.f90 OpenMP details 1 h = 1.0 d0 / dble ( n ) 2 sum = 0.0 d0 3! $omp p a r a l l e l p r i v a t e ( i, x ) shared ( h, fx, n ) 4! $omp do 5 do i = 1, n 6 x = h ( dble ( i ) 0.5 d0 ) 7 f x ( i )= f ( x ) 8 enddo 9! $omp end do 10! $omp end p a r a l l e l 11 do i = 1,n 12 sum = sum + f x ( i ) 13 end do 14 p i = h sum 1. Surround the do loop with parallel and end parallel. 2. To prevent that all threads execute this loop redundantly, the loop is further enclosed by do and end do directives in order to distribute the loop iterations to all processors (worksharing).

8 / 26 opi01.f90 (cont d) OpenMP details 3. Since by default all variables are accessible by all threads (shared), the exceptions have to be taken care of. A first candidate for privatization is the loop index i. If the loop iterations shall be distributed, the loop index has to be private. This is realized through a private clause of the parallel directive. 4. As a second candidate for privatization there is the variable x, which is used to temporally store the node of the quadrature formula. This happens independently for each loop iteration and the variable contents is not needed after the loop. 5. If sum is shared, all the processors compete each other to read the value and add newly calculated value of f pxq stored in arry fx. Indexes of fx to be accessed by each processor are different, so shared. A value of sum be messed up.

9 / 26 opi02.f90 OpenMP details 1 sum = 0.0 d0 2! $omp p a r a l l e l 3! $omp do p r i v a t e ( i, x ) 4 do i = 1, n 5 x = h ( dble ( i ) 0.5 d0 ) 6! $omp c r i t i c a l 7 sum = sum + f ( x ) 8! $omp end c r i t i c a l 9 enddo 10! $omp end do 11! $omp end p a r a l l e l 1. Critical regions are segments of code which can only be executed by a single thread at a time. 2. This version however involves quite some overhead, because it introduces a synchronization with every iteration of the inner loop.

10 / 26 opi03.f90 OpenMP details 1! $omp p a r a l l e l p r i v a t e ( i, x, sum_local ) 2 sum_local =0.0 d0 3 4! $omp do 5 do i = 1, n 6 x = h ( dble ( i ) 0.5 d0 ) 7 sum_local = sum_local + f ( x ) 8 enddo 9! $omp end do 10 11! $omp c r i t i c a l 12 sum = sum + sum_local 13! $omp end c r i t i c a l 14! $omp end p a r a l l e l 1. The next version introduces an additional private variable, in which the individual threads sum up their contributions. 2. The total sum is then computed in a critical region after the parallel loop: one by one.

11 / 26 opi04.f90 OpenMP details 1! $omp p a r a l l e l p r i v a t e ( i, x ) 2! $omp do r e d u c t i o n ( + : sum) 3 do i = 1, n 4 x = h ( dble ( i ) 0.5 d0 ) 5 sum = sum + f ( x ) 6 enddo 7! $omp end do 8! $omp end p a r a l l e l 1. Analogous to the reduction function in MPI - a reduction clause of the do directive.

12 / 26 opi05.f90 OpenMP details 1! $omp p a r a l l e l do p r i v a t e ( i, x ) r e d u c t i o n ( + : sum) 2 do i = 1, n 3 x = h ( dble ( i ) 0.5 d0 ) 4 sum = sum + f ( x ) 5 enddo 1. Shortest form.

13 / 26 opi06.f90 OpenMP details 1!$OMP PARALLEL p r i v a t e ( myid ) 2 myid = OMP_GET_THREAD_NUM( ) 3 numthreads = omp_get_num_threads ( ) 4 write (, ) myid, numthreads 5!$OMP DO SCHEDULE( STATIC, numthreads ) r e d u c t i o n ( + : sum) 6 do i = 1, n 7 x = h ( dble ( i ) 0.5 d0 ) 8 sum = sum + f ( x ) 9 enddo 10!$OMP end do 11!$OMP END PARALLEL

14 / 26 opi06.f90 OpenMP details SCHEDULE(type, chunk): The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are: 1 static: Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. 2 dynamic: Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter chunk defines the number of contiguous iterations that are allocated to a thread at a time. 3 guided: A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter chunk.

15 / 26 Outline Matrix-Vector Multiplication 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

Matrix-Vector Multiplication Matrix-Vector Multiplication: Basics 16 / 26 A 1 3 1 1 1 2, x 2 0 2 3 4 1, (1) b A x This means 1 3 1 1 1 2 2 3 4 1 3 1 1 1 2 2 3 4 2 0 1 2 9 5 0 2 1 1 3 2 1 0 0 1 0 0 0 1 0 0 0 1 (2) (3) (4)

Matrix-Vector Multiplication 1 program MatMul01 2 i m p l i c i t none 3 integer ( kind =8) : : i, j 4 real ( kind =16) : : A( 3, 3 ), b ( 3 ), x ( 3 ), sum 5! Column wise 6 data x / 2., 0., 1. / 7 data A / 1., 1., 2., 3., 1., 3., 1., 2., 4. / 8! M a t r i x m u l t i p l i c a t i o n b= A x 9 do i = 1,3 10 sum = 0. d0 11 do j = 1,3 12 sum = sum + a ( i, j ) x ( j ) 13 enddo 14 b ( i ) = sum 15 enddo 16! Display A, x, and b=ax 17 open(10, f i l e = matmul00. dat ) 18 do i =1,3 19 write (10, " ( ( 5 ( 2 x, F12. 6 ) ) ) " ) (A( i, j ), j =1,3), x ( i ), b ( i ) 20 enddo 21 close (10) 22 stop 23 end 17 / 26

Matrix-Vector Multiplication 18 / 26 1 program MatMul01 2 i m p l i c i t none 3 integer ( kind =8) : : i, j 4 real ( kind =16) : : A( 3, 3 ), b ( 3 ), x ( 3 ) 5! Column wise 6 data x / 2., 0., 1. / 7 data A / 1., 1., 2., 3., 1., 3., 1., 2., 4. / 8! M a t r i x m u l t i p l i c a t i o n b= A x 9 10 b = matmul (A, x ) 11 12! Display A, x, and b=ax 13 open(10, f i l e = matmul01. dat ) 14 do i =1,3 15 write (10, " ( ( 5 ( 2 x, F12. 6 ) ) ) " ) (A( i, j ), j =1,3), x ( i ), b ( i ) 16 enddo 17 close (10) 18 stop 19 end

19 / 26 Outline Matrix-Matrix Multiplication 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

Matrix-Matrix Multiplication A B C? 20 / 26 A 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 A B C? (5) 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3, B 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 where A and B are n n square matrixes where n 8. What is C? C ij A ik B kj 1 2 2 2 8 2 204 k

Matrix-Matrix Multiplication 21 / 26 1 program MatMul01 2 i m p l i c i t none 3 integer ( kind =8) : : i, j, k 4 real ( kind =16) : : A( 8, 8 ), B( 8, 8 ), c ( 8, 8 ), sum, Csum 5 do i = 1,8 6 do j = 1,8 7 A( i, j ) = dble ( j ) 8 enddo 9 enddo 10 B=transpose (A) 11 do i = 1,8 12 do j = 1,8 13 sum = 0. d0 14 do k = 1, 8 15 sum = sum + A( i, k ) B( k, j ) 16 enddo 17 C( i, j ) = sum 18 enddo 19 enddo 20 Csum=0. d0 21 do i =1,8 22 do j = 1,8 23 Csum = Csum + C( i, j ) 24 enddo 25 enddo 26 write (, ) Csum 27 28! Display C 29 open(10, f i l e = matmul02. dat ) 30 do i =1,8 31 write (10, " ( ( 8 ( 2 x, F8. 2 ) ) ) " ) (C( i, j ), j =1,8) 32 enddo 33 close (10) 34 stop 35 end

22 / 26 Outline Lab work 1 Parallel computing terms 2 OpenMP details 3 Matrix-Vector Multiplication 4 Matrix-Matrix Multiplication 5 Lab work

23 / 26 PBS practice MPI Lab work After you login fractal, do the followings: $ mkdir\samples $ cd\samples $ cp\-r\/opt/cee618s13/samples/*\./ $ cd\pbs $ cd\mpintel $ make $ make\que $ watch\qstat $ ls\-l $ cat\pbs-mypi*

Lab work PBS practice MPI (cont d) 24 / 26 There are two pbs files PRLsample16.pbs and PRLsample32.pbs. 1 PRLsample16.pbs is for runs using as many as 16 or less processors. 2 PRLsample32.pbs is for runs using as many as 32 or less processors (more than 16). A new customized script mpirun.sh" will be used. 1 Usage for 8 processors: mpirun.sh\-np\8\./fpi.x

Lab work PBS practice OpenMP 25 / 26 After the MPI practice, do the followings: $ cd\../openmp $ make $ make\run A new customized script omprun.sh" will be used. This script automatically connects compute-1-16 node and run OpenMP program. The number of threads is specified using -n option. The default number of threads is 1, if -n option is not used. 1 Usage: omprun\-n\<thread number>\./myprogram 2 Example: omprun\-n\8\./omp7.x

Lab work PBS practice Serial 26 / 26 After the OpenMP practice, do the followings: $ cd\../serial $ qsub\srlsample1.pbs SRLsample1.pbs can be used to submit serial jobs after the command on the last line is modified