Cache Contention and Application Performance Prediction for Multi-Core Systems

Size: px
Start display at page:

Download "Cache Contention and Application Performance Prediction for Multi-Core Systems"

Transcription

1 Cache Contention and Application Performance Prediction for Multi-Core Systems Chi Xu, Xi Chen, Robert P. Dick, Zhuoqing Morley Mao University of Minnesota, University of Michigan IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), March / 13

2 Motivation Multiprocessor architectures (CMP) with shared last-level caches + Inter-process communication + Heterogeneous cache allocation Contention 2 / 13

3 Motivation he rest of this paper is organized as follows. Section II ents related work. Sections III and IV motivate and ribe CAMP. Section V introduces an automated way haracterize process memory access behavior to permit Multiprocessor architectures (CMP) with shared last-level caches prediction of cache contention. Section VI presents and usses the experimental validation process and results. lly, Section VII summarizes our work. + Inter-process communication + Heterogeneous II. cache RELATED WORK allocation ast work [6], [7], [8], [9] has considered the problem djusting cache Contention partitioning during run time after process gnment decisions have already been made. In contrast, goal of our work is to predict the performance implins of process assignment decisions before execution. er researchers have developed performance prediction els to guide process assignment. However, most [10], addressed cache contention only for uniprocessors on ch only a single process may run at a time. The move MPs will aggravate the cache contention problem since tiple processes can run on different cores simultaneously. esource contention models for simultaneous multithread- Normalized Execution Time Figure 1. with it. Performance implications of core assignment 1 art mcf bzip2 swim equake mesa vpr ammp mgrid applu Cache Misses per L2 Access Impact of stressmark on performance of processes sharing case models use the reuse distances and/or circular sequence profiles for each thread to predict inter-thread cache contention. These models require knowledge of the steady-state L2 cache access frequency of a process when concurrently running 2 / 13

4 Goal Model cache contention Easy and automatic No modifications to existing hardware or operating system No exhaustive offline simulation Complementary to existing work 3 / 13

5 Analytical Model System N-core processor On-chip last-level L2 Cache Set-associative (ways = lines per set) LRU replacement policy Shared among cores No Prefetching Applications in steady state 4 / 13

6 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache 5 / 13

7 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache Reuse Distance Probability (%) Reuse distance Figure 2. Cache line reuse distance histogram for mcf application. execution time of art increased by 120% while that of Reuse Distance: We define the reuse distanc cache line j to be the number of distinct cache lin the same set accessed between two consecutive ac line j. A reuse distance histogram represents the d of cache line reuse distances for an entire shar Given an A-way set-associative cache, Figure 2 show distance histogram for the mcf application (see Se The x-axis shows the reuse distance and the y-a the normalized frequencies of the associated reuse The first bar in the histogram, i.e., hist 1, gives the p that a most-recently-used line will be accessed ag the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13

8 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) Probability (%) i=1 A Associativity of cache Reuse Distance 30 Reuse Distance: We define the reuse distanc MPA 25 cache line i (S j i ) = hist to be the number i (x) dx S i of distinct cache lin the same set accessed between two consecutive ac 20 line j. A reuse distance histogram represents the d 15 MPA of cache Probability line reuseofdistances cache miss for an forentire shar 10 Given process an A-way i set-associative cache, Figure 2 show 5 distance histogram for the mcf application (see Se hist TheLinear x-axis interpolation shows the reuseof distance reuse and the y-a the distance normalizedhistgram frequencies of the associated reuse Reuse distance The first bar in the histogram, i.e., hist 1, gives the p Figure 2. Cache line reuse distance histogram for mcf application. that a most-recently-used line will be accessed ag execution time of art increased by 120% while that of the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13

9 Analytical Model Applications II Cache Accesses APS = API SPI APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction 6 / 13

10 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) 6 / 13

11 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13

12 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 steady state n = G 1 i (S i ) G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13

13 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache 7 / 13

14 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) 7 / 13

15 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T 7 / 13

16 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i Reminder APS = API SPI SPI = α MPA + β 7 / 13

17 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β 7 / 13

18 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β i=1 N S i = A i=1 N j=1 : G 1 1 (S 1) G 1 j (S j ) API 1 (α j MPA j (S j ) + β j ) N API i (α 1 MPA 1 (S 1 ) + β 1 ) and S i A = 0 7 / 13

19 Automated Profiling Two processes running on separate cores sharing A-way last-level cache One process uses l ways other process uses A l ways stressmark: synthetic application with configurable cache occupation Gather information on API, MPA and SPI via hardware performance counters Derive reuse distance histogram, effective cache size (S), α and β application-dependent feature vector 8 / 13

20 Evaluation Intel Core 2 Duo-P8600 (2 2.4GHz, 3 MB 12-way associative L2 cache) MacOS X 10.5 Profiling via Shark at a period of 2 ms Subset of SPEC CPU2000: 5 CPU-intensive + 5 memory-intensive Each application run 12 times for 10 s to determine characteristics Examine all 55 pairwise combinations 9 / 13

21 Application profiles TABLE II API, α, AND β FOR DIFFERENT BENCHMARKS Benchmark art mcf bzip2 swim equake mesa vpr ammp mgrid applu API α ( 10 9 ) β ( 10 7 ) Miss Rate art Miss Rate mcf Miss Rate vpr Miss Rate mesa Miss Rate mgrid Miss Rate swim Miss Rate ammp Miss Rate applu Figure 3. Profiled cache miss rate corresponding to effective cache size. proposed by Chandra et al. [5] requires the steady-state cache access frequency of a process to be known a priori. We see no practical way to accurately predetermine this value for by AB, MB, and CAMP. AB and MB are not past work. They are in fact alternative prediction models we considered. Table III presents the average prediction error in cache10 / 13

22 Prediction Accuracy TABLE III PREDICTION ACCURACY FOR CACHE MISSES AND PERFORMANCE DEGRADATION CAMP AB MB MPA SPI MPA SPI MPA SPI Benchmark Error >5% Error >5% Error >5% Error >5% Error >5% Error >5% (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) art vpr mcf ammp bzip mesa swim equake applu mgrid top 5 average average also explains why memory-intensive benchmarks have r estimation error than CPU-intensive benchmarks. In III, the bottom 5 benchmarks are either CPU-intensive cations or streaming applications with constant high rates, e.g., swim. Their performance estimation errors are because it uses monotonic non-linear functions. This m significantly reduce computational cost when the numbe cores is large. In addition, since the three models are ba on estimating the effective cache sizes of two processes, give the same results when two instances of art are runn 11 / 13

23 Generality art art 12-way 3M 16-way 4M 24-way 6M Miss Rate / 13

24 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% 13 / 13

25 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% Discussion Varying input data Benchmarking crimes Generalisation Practical application 13 / 13

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,

More information

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master

More information

Profile-Based Adaptation for Cache Decay

Profile-Based Adaptation for Cache Decay Profile-Based Adaptation for Cache Decay KARTHIK SANKARANARAYANAN and KEVIN SKADRON University of Virginia Cache decay is a set of leakage-reduction mechanisms that put cache lines that have not been accessed

More information

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Fixed-Priority

More information

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

More information

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21 Executive Summary 2 Mitigating the memory wall requires large, highly

More information

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors Abstract - This paper presents a task migration algorithm for dynamic thermal management of SMT multi-core

More information

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts. TDDI4 Concurrent Programming, Operating Systems, and Real-time Operating Systems CPU Scheduling Overview: CPU Scheduling CPU bursts and I/O bursts Scheduling Criteria Scheduling Algorithms Multiprocessor

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 23 February 2017 Announcements HW#5 due HW#6 will be posted 1 Oh No, More

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer

More information

Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques

Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques Magnus Jahre Master of Science in Computer Science Submission date: July 2007 Supervisor: Lasse

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors

Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors Vinay Hanumaiah Electrical Engineering Department Arizona State University, Tempe, USA Email: vinayh@asu.edu

More information

CHAPTER 5 - PROCESS SCHEDULING

CHAPTER 5 - PROCESS SCHEDULING CHAPTER 5 - PROCESS SCHEDULING OBJECTIVES To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various CPU-scheduling algorithms To discuss evaluation criteria

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis

Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis Vidyasagar Nookala David J. Lilja Sachin S. Sapatnekar ECE Dept, University of Minnesota,

More information

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation*

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation* Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation* Meng Xu Linh Thi Xuan Phan Hyon-Young Choi Insup Lee University of Pennsylvania Abstract We introduce

More information

Blind Identification of Power Sources in Processors

Blind Identification of Power Sources in Processors Blind Identification of Power Sources in Processors Sherief Reda School of Engineering Brown University, Providence, RI 2912 Email: sherief reda@brown.edu Abstract The ability to measure power consumption

More information

Combine Dynamic Time-slice Scaling with DVFS for Coordinating Thermal and Fairness on CPU

Combine Dynamic Time-slice Scaling with DVFS for Coordinating Thermal and Fairness on CPU Combine Dynamic Time-slice Scaling with DVFS for Coordinating Thermal and Fairness on CPU Gangyong Jia Department of Computer Science and Technology Hangzhou Dianzi University Hangzhou, China gangyong@hdu.edu.cn

More information

Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy

Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 6-1-2012 Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy Brendan Fitzgerald

More information

Online Work Maximization under a Peak Temperature Constraint

Online Work Maximization under a Peak Temperature Constraint Online Work Maximization under a Peak Temperature Constraint Thidapat Chantem Department of CSE University of Notre Dame Notre Dame, IN 46556 tchantem@nd.edu X. Sharon Hu Department of CSE University of

More information

Parallel Longest Common Subsequence using Graphics Hardware

Parallel Longest Common Subsequence using Graphics Hardware Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related

More information

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:

More information

Caches in WCET Analysis

Caches in WCET Analysis Caches in WCET Analysis Jan Reineke Department of Computer Science Saarland University Saarbrücken, Germany ARTIST Summer School in Europe 2009 Autrans, France September 7-11, 2009 Jan Reineke Caches in

More information

Robust Optimization of a Chip Multiprocessor s Performance under Power and Thermal Constraints

Robust Optimization of a Chip Multiprocessor s Performance under Power and Thermal Constraints Robust Optimization of a Chip Multiprocessor s Performance under Power and Thermal Constraints Mohammad Ghasemazar, Hadi Goudarzi and Massoud Pedram University of Southern California Department of Electrical

More information

Potentials of Branch Predictors from Entropy Viewpoints

Potentials of Branch Predictors from Entropy Viewpoints Potentials of Branch Predictors from Entropy Viewpoints Takashi Yokota,KanemitsuOotsu, and Takanobu Baba Department of Information Science, Utsunomiya University, 7 2 Yoto, Utsunomiya-shi, Tochigi, 32

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Evaluating Sampling Based Hotspot Detection

Evaluating Sampling Based Hotspot Detection Evaluating Sampling Based Hotspot Detection Qiang Wu and Oskar Mencer Department of Computing, Imperial College London, South Kensington, London SW7 2AZ, UK {qiangwu,oskar}@doc.ic.ac.uk http://comparch.doc.ic.ac.uk

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Evaluating Linear Regression for Temperature Modeling at the Core Level

Evaluating Linear Regression for Temperature Modeling at the Core Level Evaluating Linear Regression for Temperature Modeling at the Core Level Dan Upton and Kim Hazelwood University of Virginia ABSTRACT Temperature issues have become a first-order concern for modern computing

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Parallel Polynomial Evaluation

Parallel Polynomial Evaluation Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu

More information

Thermal Scheduling SImulator for Chip Multiprocessors

Thermal Scheduling SImulator for Chip Multiprocessors TSIC: Thermal Scheduling SImulator for Chip Multiprocessors Kyriakos Stavrou Pedro Trancoso CASPER group Department of Computer Science University Of Cyprus The CASPER group: Computer Architecture System

More information

Energy-Efficient Management of Reconfigurable Computers 82. Processor caches are critical components of the memory hierarchy that exploit locality to

Energy-Efficient Management of Reconfigurable Computers 82. Processor caches are critical components of the memory hierarchy that exploit locality to Energy-Efficient Management of Reconfigurable Computers 82 4 cache reuse models 4.1 Overview Processor caches are critical components of the memory hierarchy that exploit locality to keep frequently-accessed

More information

USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1

USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1 USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1 Sung Woo Chung and Kevin Skadron Division of Computer Science and Engineering, Korea University, Seoul 136-713, Korea

More information

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Some thoughts about energy efficient application execution on NEC LX Series compute clusters Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science

More information

Leakage Minimization Using Self Sensing and Thermal Management

Leakage Minimization Using Self Sensing and Thermal Management Leakage Minimization Using Self Sensing and Thermal Management Alireza Vahdatpour Computer Science Department University of California, Los Angeles alireza@cs.ucla.edu Miodrag Potkonjak Computer Science

More information

EDF Feasibility and Hardware Accelerators

EDF Feasibility and Hardware Accelerators EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo, Waterloo, Canada, arrmorton@uwaterloo.ca Wayne M. Loucks University of Waterloo, Waterloo, Canada, wmloucks@pads.uwaterloo.ca

More information

Andrew Morton University of Waterloo Canada

Andrew Morton University of Waterloo Canada EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo Canada Outline 1) Introduction and motivation 2) Review of EDF and feasibility analysis 3) Hardware accelerators and scheduling

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy Computer Architecture ESE 345 Computer Architecture Performance and Energy Consumption 1 Two Notions of Performance Plane Boeing 747 DC to Paris 6.5 hours Top Speed 610 mph Passengers Throughput (pmph)

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Reducing the Run-time of MCMC Programs by Multithreading on SMP Architectures

Reducing the Run-time of MCMC Programs by Multithreading on SMP Architectures Reducing the Run-time of MCMC Programs by Multithreading on SMP Architectures Jonathan M. R. Byrd Stephen A. Jarvis Abhir H. Bhalerao Department of Computer Science University of Warwick MTAAP IPDPS 2008

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 20 February 2018 Announcements HW#4 was posted. About branch predictors Don

More information

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science -25 Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

More information

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING Abhishek Mishra and Anil Kumar Tripathi Department of

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Lecture 2: Paging and AdWords

Lecture 2: Paging and AdWords Algoritmos e Incerteza (PUC-Rio INF2979, 2017.1) Lecture 2: Paging and AdWords March 20 2017 Lecturer: Marco Molinaro Scribe: Gabriel Homsi In this class we had a brief recap of the Ski Rental Problem

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday Announcements Project #1 grades were returned on Monday Requests for re-grades due by Tuesday Midterm #1 Re-grade requests due by Monday Project #2 Due 10 AM Monday 1 Page State (hardware view) Page frame

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Caches in WCET Analysis Reinhard Wilhelm 1 Jan Reineke 2 1 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA ArtistDesign Summer

More information

Microarchitectural Techniques for Power Gating of Execution Units

Microarchitectural Techniques for Power Gating of Execution Units 2.2 Microarchitectural Techniques for Power Gating of Execution Units Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, Pradip Bose IBM T. J. Watson Research Center ABSTRACT

More information

Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses

Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses Krishnan Sundaresan and Nihar R. Mahapatra Department of Electrical & Computer Engineering Michigan State University, East Lansing,

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Toward Precise PLRU Cache Analysis

Toward Precise PLRU Cache Analysis Toward Precise PLRU Cache Analysis Daniel Grund Jan Reineke 2 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA Workshop on Worst-Case Execution-Time Analysis 2 Outline

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li

More information

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 12-2013 Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution

Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution Xiangyang Jia (Wuhan University) Carlo Ghezzi (Politecnico di Milano) Shi Ying (Wuhan University) Outline Motivation Logical Basis

More information

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

Branch History Matching: Branch Predictor Warmup for Sampled Simulation Branch History Matching: Branch Predictor Warmup for Sampled Simulation Simon Kluyskens Lieven Eeckhout ELIS Department, Ghent University Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium Email: leeckhou@elis.ugent.be

More information

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach Process Scheduling for RTS Dr. Hugh Melvin, Dept. of IT, NUI,G RTS Scheduling Approach RTS typically control multiple parameters concurrently Eg. Flight Control System Speed, altitude, inclination etc..

More information

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets High-Performance Computing, Planet Formation & Searching for Extrasolar Planets Eric B. Ford (UF Astronomy) Research Computing Day September 29, 2011 Postdocs: A. Boley, S. Chatterjee, A. Moorhead, M.

More information

Complex Dynamics of Microprocessor Performances During Program Execution

Complex Dynamics of Microprocessor Performances During Program Execution Complex Dynamics of Microprocessor Performances During Program Execution Regularity, Chaos, and Others Hugues BERRY, Daniel GRACIA PÉREZ, Olivier TEMAM Alchemy, INRIA, Orsay, France www-rocq.inria.fr/

More information

A Component Model of Spatial Locality

A Component Model of Spatial Locality A Component Model of Spatial Locality Xiaoming Gu Intel China Research Center xiaoming@cs.rochester.edu Ian Christoper Tongxin Bai Department of Computer Science University of Rochester {ichrist2,bai}@cs.rochester.edu

More information

Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks

Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks Abhilash Thekkilakattil, Radu Dobrin and Sasikumar Punnekkat Mälardalen Real-Time Research Center, Mälardalen University,

More information

Throughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint

Throughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint 2011 IEEE/ACM International Conference on Green Computing and Communications Throughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint Guanglei Liu 1, Gang Quan 1, Meikang

More information

Parallel Real-Time Task Scheduling on Multicore Platforms

Parallel Real-Time Task Scheduling on Multicore Platforms Parallel Real-Time Task Scheduling on Multicore Platforms James H. Anderson and John M. Calandrino Department of Computer Science, The University of North Carolina at Chapel Hill Abstract We propose a

More information

arxiv: v1 [cs.os] 6 Jun 2013

arxiv: v1 [cs.os] 6 Jun 2013 Partitioned scheduling of multimode multiprocessor real-time systems with temporal isolation Joël Goossens Pascal Richard arxiv:1306.1316v1 [cs.os] 6 Jun 2013 Abstract We consider the partitioned scheduling

More information

CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When

CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When 1 CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When Inkwon Hwang, Student Member and Massoud Pedram, Fellow, IEEE Abstract

More information

Timing analysis and predictability of architectures

Timing analysis and predictability of architectures Timing analysis and predictability of architectures Cache analysis Claire Maiza Verimag/INP 01/12/2010 Claire Maiza Synchron 2010 01/12/2010 1 / 18 Timing Analysis Frequency Analysis-guaranteed timing

More information

Module 5: CPU Scheduling

Module 5: CPU Scheduling Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 5.1 Basic Concepts Maximum CPU utilization obtained

More information

VMware VMmark V1.1 Results

VMware VMmark V1.1 Results Vendor and Hardware Platform: IBM System x3950 M2 Virtualization Platform: VMware ESX 3.5.0 U2 Build 110181 Performance VMware VMmark V1.1 Results Tested By: IBM Inc., RTP, NC Test Date: 2008-09-20 Performance

More information

Chapter 6: CPU Scheduling

Chapter 6: CPU Scheduling Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 6.1 Basic Concepts Maximum CPU utilization obtained

More information

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS Richard A. Dutton and Weizhen Mao Department of Computer Science The College of William and Mary P.O. Box 795 Williamsburg, VA 2317-795, USA email: {radutt,wm}@cs.wm.edu

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

A Data Communication Reliability and Trustability Study for Cluster Computing

A Data Communication Reliability and Trustability Study for Cluster Computing A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX HPC Introduction Relevant to a variety of sciences,

More information

A Novel Software Solution for Localized Thermal Problems

A Novel Software Solution for Localized Thermal Problems A Novel Software Solution for Localized Thermal Problems Sung Woo Chung 1,* and Kevin Skadron 2 1 Division of Computer and Communication Engineering, Korea University, Seoul 136-713, Korea swchung@korea.ac.kr

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Summarizing Measured Data

Summarizing Measured Data Performance Evaluation: Summarizing Measured Data Hongwei Zhang http://www.cs.wayne.edu/~hzhang The object of statistics is to discover methods of condensing information concerning large groups of allied

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

Formal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers

Formal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers Formal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers Sarani Bhattacharya and Debdeep Mukhopadhyay Indian Institute of Technology Kharagpur PROOFS 2016 August

More information

Runtime Model Predictive Verification on Embedded Platforms 1

Runtime Model Predictive Verification on Embedded Platforms 1 Runtime Model Predictive Verification on Embedded Platforms 1 Pei Zhang, Jianwen Li, Joseph Zambreno, Phillip H. Jones, Kristin Yvonne Rozier Presenter: Pei Zhang Iowa State University peizhang@iastate.edu

More information

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu

More information

Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks

Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks Moonju Park Ubiquitous Computing Lab., IBM Korea, Seoul, Korea mjupark@kr.ibm.com Abstract. This paper addresses the problem of

More information

Predictability of Least Laxity First Scheduling Algorithm on Multiprocessor Real-Time Systems

Predictability of Least Laxity First Scheduling Algorithm on Multiprocessor Real-Time Systems Predictability of Least Laxity First Scheduling Algorithm on Multiprocessor Real-Time Systems Sangchul Han and Minkyu Park School of Computer Science and Engineering, Seoul National University, Seoul,

More information