Vector Lane Threading

Similar documents
INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CMP N 301 Computer Architecture. Appendix C

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

Analytical Modeling of Parallel Systems

CMP 338: Third Class

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

CSCI-564 Advanced Computer Architecture

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Fall 2008 CSE Qualifying Exam. September 13, 2008

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance Evaluation of Scientific Applications on POWER8

ICS 233 Computer Architecture & Assembly Language

Origami: Folding Warps for Energy Efficient GPUs

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

CS 700: Quantitative Methods & Experimental Design in Computer Science

Performance, Power & Energy

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

Computer Architecture ELEC2401 & ELEC3441

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Introduction The Nature of High-Performance Computation

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Exploring performance and power properties of modern multicore chips via simple machine models

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

CMP 334: Seventh Class

Performance Prediction Models

Perm State University Research-Education Center Parallel and Distributed Computing

Lecture: Pipelining Basics

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

CSE370: Introduction to Digital Design

Transposition Mechanism for Sparse Matrices on Vector Processors

Cyclops Tensor Framework

CPSC 3300 Spring 2017 Exam 2

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Lecture 13: Sequential Circuits, FSM

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

PetaBricks: Variable Accuracy and Online Learning

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Lecture 13: Sequential Circuits, FSM

Basic Computer Organization and Design Part 3/3

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

Clock-driven scheduling

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Quantum Memory Hierarchies

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

/ : Computer Architecture and Design

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

All-electron density functional theory on Intel MIC: Elk

Strassen s Algorithm for Tensor Contraction

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors

Andrew Morton University of Waterloo Canada

Scalable Store-Load Forwarding via Store Queue Index Prediction

Parallel Transposition of Sparse Data Structures

Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Considering Variable and Fixed Components of the System Power Dissipation

CS-683: Advanced Computer Architecture Course Introduction

Branch Prediction using Advanced Neural Methods

A Detailed Study on Phase Predictors

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Intraprogram Dynamic Voltage Scaling: Bounding Opportunities with Analytic Modeling

EC 413 Computer Organization

Portland State University ECE 587/687. Branch Prediction

Review for the Midterm Exam

Processor Design & ALU Design

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Toward High Performance Matrix Multiplication for Exact Computation

Dense Arithmetic over Finite Fields with CUMODP

Fast Sparse Matrix Factorization on Modern Workstations. Edward Rothberg and Anoop Gupta. Department of Computer Science

Parallel Performance Theory - 1

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Transcription:

Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University

Motivation Vector processors excel at data-level parallelism (DLP) What happens to program phases with little or no DLP? Vector Lane Threading (VLT) Leverage idle DLP resources to exploit thread-level parallelism (TLP) 1.4-2.3x speedup on already optimized code Small increase in system cost VLT increases the applicability of vector processors Efficient for both regular & irregular parallelism 2

Outline Motivation Vector processor background Generic vector microarchitecture Vector processors and DLP Vector lane threading (VLT) VLT evaluation Conclusions 3

Vector Microarchitecture Overview Scalar Unit Vector Unit L2 Cache / Memory Interface 4

Vector Microarchitecture Overview Fetch Scalar Processor Vector Unit D$ ROB L2 Cache / Memory Interface 5

Vector Microarchitecture Overview Fetch Scalar Unit Vector Control Logic VRF VRF ROB LSU FUs LSU FUs L2 Cache / Memory Interface 6

Vector Efficiency with High DLP mxm sage for i = 1 to N for j = 1 to N for k = 1 to N C[i,j] = C[i,j]+A[i,k]*C[k,j] Speedup 8 7 6 5 4 3 2 1 0 1 2 4 8 Vector Lanes Best case for vector processors: long vectors & regular memory patterns Lanes execute data-parallel operations very efficiently Low instruction issue rate, simple control, compact code, power efficiency Simple model for scalable performance Current vector processors have 4 to 16 lanes 7

Vector Efficiency with Low DLP 8 7 Speedup 6 5 4 3 2 1 mpenc trfd multprec bt radix ocean barnes 0 1 2 4 8 Vector Lanes Low DLP underutilized lanes & memory ports Short vectors No vectors Vector length vs. stride in nested loops Can we improve efficiency for low-dlp cases? 8

Outline Motivation Vector processor background Vector lane threading (VLT) Overview Possible configurations Implementation VLT Evaluation Conclusions 9

VLT Basics Idea: use idle lanes to exploit thread-level parallelism (TLP) Sources of TLP Outer loops in loop nests Non-vectorizable loops Task-level parallelism VLT benefits Does not harm high-dlp performance Higher utilization for DLP resources Vector unit can be shared between multiple threads From SMT or CMP scalar processors 10

VLT Configurations: Single Vector Thread Original configuration for long vector lengths (high DLP) 1 thread running vector instructions on 8 lanes Vector Lanes L0 L1 L2 L3 L4 L5 L6 L7 11

VLT Configurations: 2 Vector Threads Medium vector length configuration Two threads, each running vector instructions on 4 lanes Threads may be controlled by SMT or CMP scalar processor (more later) Vector Lanes T0, L0 T0, L1 T0, L2 T0, L3 T1, L0 T1, L1 T1, L2 T1, L3 12

VLT Configurations: 4 Vector Threads Short vector length configuration Four threads, each running vector instructions on 2 lanes Threads may be controlled by SMT or CMP scalar processor (more later) Vector Lanes T0, L0 T0, L1 T1, L0 T1, L1 T2, L0 T2, L1 T3, L0 T3, L1 13

VLT Configurations: Pure Scalar Threads Scalar (no-vector) configuration Eight threads, each running scalar instructions on 1 lane Each lane operates as a simple processor Vector Lanes T0, L0 T1, L0 T2, L0 T3, L0 T4, L0 T5, L0 T6, L0 T7, L0 14

VLT vs. SMT VLT: exploit TLP on idle DLP resources Different threads on different vector lanes Each thread uses all functional units within a lane SMT: exploit TLP on idle ILP resources Different threads on different functional units In a vector processor, each thread uses all lanes Notes VLT and SMT are orthogonal & can be combined VLT is more important for a vector processor Several apps run efficiently on a 16-lane vector processor How many apps run efficiently on a 16-way issue processor? 15

VLT Implementation Fetch Scalar Unit Vector Control Logic VRF VRF ROB LSU FUs LSU FUs L2 Cache / Memory Interface 16

VLT Implementation: Execution Resources Functional units already available Vector registers for additional threads already available VRF slice in each lane can store the register elements for each thread Note that each thread uses shorter vectors Memory ports already available Necessary to support indexed and strided patterns even with long vectors Vector Control Logic VRF VRF LSU FUs LSU FUs 17

VLT Implementation: Vector Instr. Bandwidth Vector instruction issue bandwidth Must issue vector instructions separately for each thread Multiplex single vector control block, or replicate the block Multiplexing is sufficient as each vector instruction takes multiple cycles Instruction set Minor addition to specify configuration (number of threads) Vector Control Logic VRF VRF LSU FUs LSU FUs 18

VLT Implementation: Scalar Instr. Bandwidth Fetch Must process scalar instructions for multiple vector threads Scalar Processor D$ ROB Design alternatives Attach vector unit to SMT processor Attach vector unit to CMP processor Multiple cores share one vector unit Combination of the above (CMT) Heterogeneous CMP One complex & multiple simpler cores share a vector unit Trade-off Cost vs. utilization of vector unit 19

VLT Implementation: Scalar Threads on Vector Unit Challenge: each lane requires 1-2 instructions per clock cycle Much higher instruction bandwidth than with vector threads No point in using multiple scalar cores to feed the lanes Design approach Introduce a small instruction cache in each lane Single-ported, 1 to 4 Kbytes is sufficient Feeds the functional units with instructions in tight loops Cache misses in lanes handled by scalar core Scalar core instruction cache acts as a L2 instruction cache Low miss penalty 20

Outline Motivation Vector processor background Vector lane threading (VLT) VLT Evaluation Methodology Vector thread results Scalar thread results Conclusions 21

Methodology Simulated processor Based on Cray X1 ISA Scalar unit: 4-way OOO superscalar Vector registers: 32 vector registers, max. vector length of 64 Vector execution resources: 8 lanes, 3 functional units per lane Software environment All code compiled with production Cray C and Fortran compilers Highest level of vector optimizations enabled All speedups reported are in addition to vectorization Used ANL macros for multithreading Could also use OpenMP or other mulithreading approaches Further details in the paper 22

Benchmarks Nine benchmarks, three categories High DLP (long vectors), medium DLP (short vectors), no DLP Examples High DLP: matrix multiply 96% vectorized, average vector length of 64 (maximum) Medium DLP: trfd (two electron integral transformation) 76% vectorized, average vector length of 11.2 No DLP: ocean (eddy currents in ocean basin) 0% vectorized Note Benchmarks include some sequential portions with no DLP, no TLP 23

Vector Thread Evaluation Base VLT - 2 threads VLT - 4 threads 2.5 2 Speedup 1.5 1 0.5 0 m penc trfd m ultprec bt Results for medium-dlp benchmarks on best scalar-core configuration Up to 2.3x performance improvement On top of vectorization for these applications Limitations Saturation of vector resources Cache effects, thread overhead, purely sequential portions 24

VLT Cost Evaluation Default configuration: SU (4-way OOO) + VU SMT scalar unit 0.8% area increase for 2 VLT threads 1.3% area increase for 4 VLT threads CMP scalar unit (4-way OOO cores) 12.3% area increase for 2 VLT threads 26.9% area increase for 4 VLT threads CMT scalar unit (4-way OOO cores, 2 threads/core) 13.8% area increase for 4 VLT threads Heterogeneous CMP scalar unit (CMP-h) 3.4% area increase for 2 VLT threads 10.1% area increase for 4 VLT threads 25

VLT Design Space Evaluation 2.5 V2-SMT V2-CMP V4-SMT V4-CMT V4-CMP V4-CMP-h 2.0 Speedup 1.5 1.0 0.5 0.0 mpenc trfd multprec bt V4-CMT equal performance to V4-CMP At lower area increase (13.8% vs. 26.9%) Two 4-way OOO processors can saturate an 8-lane vector unit V4-SMT outperforms V4-CMP-h V4-CMP-h configuration suffers thread imbalance 26

Scalar Thread Evaluation Scalar CM P VLT 2.5 2.0 2.2 2.0 Speedup 1.5 1.0 1.1 0.5 0.0 radix ocean barnes 8-lane VLT operates like a 8-way CMP with very simple cores No out-of-order execution, wide issues, branch prediction, etc Up to 2x compared to CMP (4-way cores with 2 threads/core) But performance may vary due to sequential code performance 27

Conclusions Vector Lane Threading (VLT) Turn underutilized DLP resources (lanes) into TLP resources Multiple scalar processors share a vector unit Vector unit used as a CMP with very simple cores Results Up to 2.3x performance for applications with short vectors Up to 2.0x performance for applications with no vectors Cost-effective implementation alternatives Overall, VLT improves the applicability of vector processors Good efficiency with both high DLP and low DLP 28

Acknowledgments Cray X1 compiler access Research funding through DARPA HPCS Stanford Graduate Fellowship National Science Foundation Fellowships Stanford Computer Forum 29