Lecture 19. Architectural Directions

Similar documents
ECE 571 Advanced Microprocessor-Based Design Lecture 10

ab initio Electronic Structure Calculations

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

The Performance Evolution of the Parallel Ocean Program on the Cray X1

High Performance Computing

2.5D algorithms for distributed-memory computing

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 18: Sharing Patterns and Cache Coherence Protocols. The Lecture Contains:

Porting a sphere optimization program from LAPACK to ScaLAPACK

CSE 160 Lecture 13. Numerical Linear Algebra

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 17: "Introduction to Cache Coherence Protocols" Invalidation vs.

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

History of Scientific Computing!

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Cpt S 223. School of EECS, WSU

Performance Evaluation of MPI on Weather and Hydrological Models

ECE520 VLSI Design. Lecture 23: SRAM & DRAM Memories. Payman Zarkesh-Ha

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday

TOPS Contributions to PFLOTRAN

URP 4273 Section 8233 Introduction to Planning Information Systems (3 Credits) Fall 2017

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Scalable Systems for Computational Biology

Measuring freeze-out parameters on the Bielefeld GPU cluster

ECE321 Electronics I

Supercomputing: Why, What, and Where (are we)?

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Lecture 4. Writing parallel programs with MPI Measuring performance

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Performance of machines for lattice QCD simulations

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

BENCHMARK STUDY OF A 3D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES

URP 4273 Section 3058 Survey Of Planning Information Systems (3 Credits) Spring 2017

High-Performance Scientific Computing

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Scalable numerical algorithms for electronic structure calculations

Optimization Techniques for Parallel Code 1. Parallel programming models

The QMC Petascale Project

Cyclops Tensor Framework

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Reliability at Scale

ERLANGEN REGIONAL COMPUTING CENTER

Logic Design II (17.342) Spring Lecture Outline

arxiv: v1 [hep-lat] 7 Oct 2010

Practical Combustion Kinetics with CUDA

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Practicality of Large Scale Fast Matrix Multiplication

GloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group

Efficient implementation of the overlap operator on multi-gpus

Junji NAKANO (The Institute of Statistical Mathematics, Japan)

P214 Efficient Computation of Passive Seismic Interferometry

Performance Evaluation of Scientific Applications on POWER8

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Direct Self-Consistent Field Computations on GPU Clusters

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Fundamentals of Computational Science

Communication-avoiding parallel and sequential QR factorizations

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

2 k Factorial Designs Raj Jain Washington University in Saint Louis Saint Louis, MO These slides are available on-line at:

CSEP 521 Applied Algorithms. Richard Anderson Winter 2013 Lecture 1

Molecular Dynamics Simulations

Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2

Communication-avoiding parallel and sequential QR factorizations

ECE 669 Parallel Computer Architecture

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

2 k Factorial Designs Raj Jain

CS 700: Quantitative Methods & Experimental Design in Computer Science

Cartesius Opening. Jean-Marc DENIS. June, 14th, International Business Director Extreme Computing Business Unit

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Case Study: Quantum Chromodynamics

URP 4273 Section 3058 Survey Of Planning Information Systems (3 Credits) Spring 2018

MSC HPC Infrastructure Update. Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada

EE141- Fall 2002 Lecture 27. Memory EE141. Announcements. We finished all the labs No homework this week Projects are due next Tuesday 9am EE141

COMPUTER SCIENCE TRIPOS

XXL-BIOMD. Large Scale Biomolecular Dynamics Simulations. onsdag, 2009 maj 13

Lecture 25. Dealing with Interconnect and Timing. Digital Integrated Circuits Interconnect

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations

Lecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Chapter 7. Sequential Circuits Registers, Counters, RAM

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

ww.padasalai.net

Transcription:

Lecture 19 Architectural Directions

Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2

Final examination Announcements Thursday, March 17, in this room: 3pm to 6pm You may bring your textbook and one piece of notebook sized paper Office hours during examination week Wednesday 11AM to 12 noon 4pm to 5pm Or by appointment 2010 Scott B. Baden / CSE 160 / Winter 2010 3

NUMA Architectures

NUMA Architectures Address space is global to all processors Distributed shared memory A directory keeps track of sharers Point-to-point messages manage coherence Stanford Dash, SGI UV, Altix, Origin 2000 2010 Scott B. Baden / CSE 160 / Winter 2010 5

Inside a directory Each processor has a 1-bit sharer entry in the directory There is also a dirty bit and a PID identifying the owner in the case of a dirt block Every block of memory has a home and an owner Initially home = owner, but this can change Memory Directory Parallel Computer Architecture, Culler, Singh, & Gupta presence bits dirty bit 2010 Scott B. Baden / CSE 160 / Winter 2010 6

Operation of a directory Assume a 4 processor system (only P0 & P1 shown) A is a location with home P1 Initial directory entry for block containing A is empty Mem $ 0 0 0 0 P0 P1 0 0 0 0 2010 Scott B. Baden / CSE 160 / Winter 2010 7

P0 loads A Operation of a directory Set directory entry for A (on P1) to indicate that P0 is a sharer Mem $ 0 0 0 0 P0 P1 1 0 0 0 2010 Scott B. Baden / CSE 160 / Winter 2010 8

Operation of a directory P2, P3 load A (not shown) Set directory entry for A (on P1) to indicate that P0 is a sharer Mem P2 $ P3 0 0 0 0 P0 P1 1 0 1 1 2010 Scott B. Baden / CSE 160 / Winter 2010 9

Acquiring ownership of a block P0 writes A P0 becomes the owner of A Mem $ 0 0 0 0 P0 P1 1 0 1 1 2010 Scott B. Baden / CSE 160 / Winter 2010 10

Acquiring ownership of a block P0 becomes the owner of A P1 s directory entry for A is set to Dirty Outstanding sharers are invalidated Access to line is blocked until all invalidations are acknowledged Mem $ P2 0 0 0 0 P0 P1 0 0 0 0 D P0 P3 2010 Scott B. Baden / CSE 160 / Winter 2010 11

Change of ownership P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A Store A, #y Store A, #x (home & owner) P1 P0 A dirty P2 Load A 1 1 D P1 Directory 2010 Scott B. Baden / CSE 160 / Winter 2010 12

Forwarding P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A home (P0) forwards request to owner (P1) P1 Store A, #x (home & owner) Store A, #y P0 A dirty P2 Load A 1 1 D P1 Directory 2010 Scott B. Baden / CSE 160 / Winter 2010 13

Performance issues Locality, locality, locality False sharing 2010 Scott B. Baden / CSE 160 / Winter 2010 14

Case Study SGI Origin 2000 2010 Scott B. Baden / CSE 160 / Winter 2010 15

Origin 2000 Interconnect 2010 Scott B. Baden / CSE 160 / Winter 2010 16

Locality 2010 Scott B. Baden / CSE 160 / Winter 2010 17

Poor Locality 2010 Scott B. Baden / CSE 160 / Winter 2010 18

Quick primer on paging We group the physical and virtual address spaces into units called pages Pages are backed up on disk Virtual to physical mapping done by the Translation Lookaside Buffer (TLB), backs up page tables set up by the OS When we allocate a block of memory, we don t need to allocate physical storage to pages; we do it on demand 2010 Scott B. Baden / CSE 160 / Winter 2010 19

Remote access latency When we allocate a block of memory, which processor(s) is (are) the owner(s)? Page allocation policies First touch Round robin Page placement and Page migration Copying v. redistribution Layout 2010 Scott B. Baden / CSE 160 / Winter 2010 20

Example Consider the following loop for r = 0 to nreps for i = 0 to n-1 a[i] = b[i] + q*c[i] 2010 Scott B. Baden / CSE 160 / Winter 2010 21

Page Migration a[i] = b[i] + q*c[i] Round robin initialization, w/ migration Parallel initialization Serial initialization (No migration) Parallel Initialization, First touch (No migration) techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/sgi_developer/books/oron2_pftune/ sgi_html/ch08.html#id5224855 2010 Scott B. Baden / CSE 160 / Winter 2010 22

Cumulative effect of Page Migration 2010 Scott B. Baden / CSE 160 / Winter 2010 23

Eliminating false sharing

False sharing Successive writes by P0 and P1 cause the processors to uselessly invalidate one another s cache P0 P1 2010 Scott B. Baden / CSE 160 / Winter 2010 25

An example of false sharing float a[m,n], s[m] // Outer loop is in parallel // Consider m=4, 128 byte cache line size // Thread i updates element s[i] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i] = 0.0 for j = 0, n-1 s[i] += a[i,j] end for end for 2010 Scott B. Baden / CSE 160 / Winter 2010 26

Avoiding false sharing float a[m,n], s[m,32] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i,1] = 0.0 for j = 0, n-1 s[i,1] += a[i,j] end for end for 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 2010 Scott B. Baden / CSE 160 / Winter 2010 27

Blue Gene IBM-US Dept of Energy collaboration First generation: Blue Gene/L 64K dual processor nodes: 180 (360) TeraFlop peak 1 TeraFlop = 1,000 GigaFlops Low power Relatively slow processors; power PC 440 Small memory (256 MB) High performance interconnect 2010 Scott B. Baden / CSE 160 / Winter 2010 28

Current Generation: Blue Gene/P Largest Installation at Argonne National Lab: 294,912 cores 4-way SMP nodes PowerPC 450 (850 MHz) 2 GB memory per node Peak performance: 13.6 Gflops/node = 557 TeraFlops total = 0.56 Petaflops http://www.redbooks.ibm.com/redbooks/sg247287 2010 Scott B. Baden / CSE 160 / Winter 2010 29

Blue Gene/P Interconnect 3D toroidal mesh (end around) 5.1 GB/sec bidirectional bandwidth / node (6 birectional links @ 425MB/sec) 5µs worst case latency, 0.5µs best case (nearest neighbor MPI: 3 µs to 10 µs Collective network Broadcast Reduction for integers and doubles Ne way tree latency 1.3 µs (5 µs in MPI) Low latency barrier and interrupt One way: 0.65µs 1.6 µs in MPI 2010 Scott B. Baden / CSE 160 / Winter 2010 30

Six connections to torus network @ 425 MB/sec/link (duplex) Three connections to global collective network @ 850 MB/ sec/link Network routers are embedded within the processor Compute nodes http://www.redbooks.ibm.com/redbooks/sg247287 2010 Scott B. Baden / CSE 160 / Winter 2010 31

Die photograph Argonne National Lab 2010 Scott B. Baden / CSE 160 / Winter 2010 32

Programming modes Virtual node: each node runs 4 MPI processes, 1/core Memory and torus network shared by all processes Shared memory is available between processes. Dual node: each node runs 2 MPI processes, 1 or 2 threads/process Symmetrical Multiprocessing: each node runs 1 MPI process, up to 4 threads 2010 Scott B. Baden / CSE 160 / Winter 2010 34

Next Generation: Blue Gene Q Sequoia: Lawrence Livermore National Lab 1.6M cores in 93,304 compute nodes 20 Petaflops = 2 10 16 flops 20,000 TerFlops = 20M GFlops 96 racks, 3,000 square feet 6M Watts ( 7 more power efficient than BGP) 2010 Scott B. Baden / CSE 160 / Winter 2010 35

What is the worlds fastest supercomputer? Go to top500.org #1: Tianhe-1A (China) 2.57 Petaflops/sec = 1M GFlops Nvidia processors #2: Jaguar (US) 1.75 Petaflops/sec Cray XT5-HE, 6-core Opteron #3: Nebulae (China) 1.27 PF #4: Tsubame (Japan) 1.19 PF 2010 Scott B. Baden / CSE 160 / Winter 2010 36

Fin