Cache Contention and Application Performance Prediction for Multi-Core Systems
|
|
- Marcia Baker
- 5 years ago
- Views:
Transcription
1 Cache Contention and Application Performance Prediction for Multi-Core Systems Chi Xu, Xi Chen, Robert P. Dick, Zhuoqing Morley Mao University of Minnesota, University of Michigan IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), March / 13
2 Motivation Multiprocessor architectures (CMP) with shared last-level caches + Inter-process communication + Heterogeneous cache allocation Contention 2 / 13
3 Motivation he rest of this paper is organized as follows. Section II ents related work. Sections III and IV motivate and ribe CAMP. Section V introduces an automated way haracterize process memory access behavior to permit Multiprocessor architectures (CMP) with shared last-level caches prediction of cache contention. Section VI presents and usses the experimental validation process and results. lly, Section VII summarizes our work. + Inter-process communication + Heterogeneous II. cache RELATED WORK allocation ast work [6], [7], [8], [9] has considered the problem djusting cache Contention partitioning during run time after process gnment decisions have already been made. In contrast, goal of our work is to predict the performance implins of process assignment decisions before execution. er researchers have developed performance prediction els to guide process assignment. However, most [10], addressed cache contention only for uniprocessors on ch only a single process may run at a time. The move MPs will aggravate the cache contention problem since tiple processes can run on different cores simultaneously. esource contention models for simultaneous multithread- Normalized Execution Time Figure 1. with it. Performance implications of core assignment 1 art mcf bzip2 swim equake mesa vpr ammp mgrid applu Cache Misses per L2 Access Impact of stressmark on performance of processes sharing case models use the reuse distances and/or circular sequence profiles for each thread to predict inter-thread cache contention. These models require knowledge of the steady-state L2 cache access frequency of a process when concurrently running 2 / 13
4 Goal Model cache contention Easy and automatic No modifications to existing hardware or operating system No exhaustive offline simulation Complementary to existing work 3 / 13
5 Analytical Model System N-core processor On-chip last-level L2 Cache Set-associative (ways = lines per set) LRU replacement policy Shared among cores No Prefetching Applications in steady state 4 / 13
6 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache 5 / 13
7 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache Reuse Distance Probability (%) Reuse distance Figure 2. Cache line reuse distance histogram for mcf application. execution time of art increased by 120% while that of Reuse Distance: We define the reuse distanc cache line j to be the number of distinct cache lin the same set accessed between two consecutive ac line j. A reuse distance histogram represents the d of cache line reuse distances for an entire shar Given an A-way set-associative cache, Figure 2 show distance histogram for the mcf application (see Se The x-axis shows the reuse distance and the y-a the normalized frequencies of the associated reuse The first bar in the histogram, i.e., hist 1, gives the p that a most-recently-used line will be accessed ag the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13
8 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) Probability (%) i=1 A Associativity of cache Reuse Distance 30 Reuse Distance: We define the reuse distanc MPA 25 cache line i (S j i ) = hist to be the number i (x) dx S i of distinct cache lin the same set accessed between two consecutive ac 20 line j. A reuse distance histogram represents the d 15 MPA of cache Probability line reuseofdistances cache miss for an forentire shar 10 Given process an A-way i set-associative cache, Figure 2 show 5 distance histogram for the mcf application (see Se hist TheLinear x-axis interpolation shows the reuseof distance reuse and the y-a the distance normalizedhistgram frequencies of the associated reuse Reuse distance The first bar in the histogram, i.e., hist 1, gives the p Figure 2. Cache line reuse distance histogram for mcf application. that a most-recently-used line will be accessed ag execution time of art increased by 120% while that of the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13
9 Analytical Model Applications II Cache Accesses APS = API SPI APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction 6 / 13
10 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) 6 / 13
11 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13
12 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 steady state n = G 1 i (S i ) G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13
13 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache 7 / 13
14 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) 7 / 13
15 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T 7 / 13
16 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i Reminder APS = API SPI SPI = α MPA + β 7 / 13
17 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β 7 / 13
18 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β i=1 N S i = A i=1 N j=1 : G 1 1 (S 1) G 1 j (S j ) API 1 (α j MPA j (S j ) + β j ) N API i (α 1 MPA 1 (S 1 ) + β 1 ) and S i A = 0 7 / 13
19 Automated Profiling Two processes running on separate cores sharing A-way last-level cache One process uses l ways other process uses A l ways stressmark: synthetic application with configurable cache occupation Gather information on API, MPA and SPI via hardware performance counters Derive reuse distance histogram, effective cache size (S), α and β application-dependent feature vector 8 / 13
20 Evaluation Intel Core 2 Duo-P8600 (2 2.4GHz, 3 MB 12-way associative L2 cache) MacOS X 10.5 Profiling via Shark at a period of 2 ms Subset of SPEC CPU2000: 5 CPU-intensive + 5 memory-intensive Each application run 12 times for 10 s to determine characteristics Examine all 55 pairwise combinations 9 / 13
21 Application profiles TABLE II API, α, AND β FOR DIFFERENT BENCHMARKS Benchmark art mcf bzip2 swim equake mesa vpr ammp mgrid applu API α ( 10 9 ) β ( 10 7 ) Miss Rate art Miss Rate mcf Miss Rate vpr Miss Rate mesa Miss Rate mgrid Miss Rate swim Miss Rate ammp Miss Rate applu Figure 3. Profiled cache miss rate corresponding to effective cache size. proposed by Chandra et al. [5] requires the steady-state cache access frequency of a process to be known a priori. We see no practical way to accurately predetermine this value for by AB, MB, and CAMP. AB and MB are not past work. They are in fact alternative prediction models we considered. Table III presents the average prediction error in cache10 / 13
22 Prediction Accuracy TABLE III PREDICTION ACCURACY FOR CACHE MISSES AND PERFORMANCE DEGRADATION CAMP AB MB MPA SPI MPA SPI MPA SPI Benchmark Error >5% Error >5% Error >5% Error >5% Error >5% Error >5% (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) art vpr mcf ammp bzip mesa swim equake applu mgrid top 5 average average also explains why memory-intensive benchmarks have r estimation error than CPU-intensive benchmarks. In III, the bottom 5 benchmarks are either CPU-intensive cations or streaming applications with constant high rates, e.g., swim. Their performance estimation errors are because it uses monotonic non-linear functions. This m significantly reduce computational cost when the numbe cores is large. In addition, since the three models are ba on estimating the effective cache sizes of two processes, give the same results when two instances of art are runn 11 / 13
23 Generality art art 12-way 3M 16-way 4M 24-way 6M Miss Rate / 13
24 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% 13 / 13
25 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% Discussion Varying input data Benchmarking crimes Generalisation Practical application 13 / 13
A Detailed Study on Phase Predictors
A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,
More informationCHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School
CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master
More informationProfile-Based Adaptation for Cache Decay
Profile-Based Adaptation for Cache Decay KARTHIK SANKARANARAYANAN and KEVIN SKADRON University of Virginia Cache decay is a set of leakage-reduction mechanisms that put cache lines that have not been accessed
More informationDesign and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources
Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Fixed-Priority
More informationAccelerating Decoupled Look-ahead to Exploit Implicit Parallelism
Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism
More informationTHE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University
THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21 Executive Summary 2 Mitigating the memory wall requires large, highly
More informationA Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors
A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors Abstract - This paper presents a task migration algorithm for dynamic thermal management of SMT multi-core
More informationTDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.
TDDI4 Concurrent Programming, Operating Systems, and Real-time Operating Systems CPU Scheduling Overview: CPU Scheduling CPU bursts and I/O bursts Scheduling Criteria Scheduling Algorithms Multiprocessor
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationECE 571 Advanced Microprocessor-Based Design Lecture 10
ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 23 February 2017 Announcements HW#5 due HW#6 will be posted 1 Oh No, More
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationEnergy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems
Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer
More informationImproving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques
Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques Magnus Jahre Master of Science in Computer Science Submission date: July 2007 Supervisor: Lasse
More informationPerformance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu
Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance
More informationReliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors
Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors Vinay Hanumaiah Electrical Engineering Department Arizona State University, Tempe, USA Email: vinayh@asu.edu
More informationCHAPTER 5 - PROCESS SCHEDULING
CHAPTER 5 - PROCESS SCHEDULING OBJECTIVES To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various CPU-scheduling algorithms To discuss evaluation criteria
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationBranch Prediction using Advanced Neural Methods
Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationTemperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis
Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis Vidyasagar Nookala David J. Lilja Sachin S. Sapatnekar ECE Dept, University of Minnesota,
More informationAnalysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation*
Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation* Meng Xu Linh Thi Xuan Phan Hyon-Young Choi Insup Lee University of Pennsylvania Abstract We introduce
More informationBlind Identification of Power Sources in Processors
Blind Identification of Power Sources in Processors Sherief Reda School of Engineering Brown University, Providence, RI 2912 Email: sherief reda@brown.edu Abstract The ability to measure power consumption
More informationCombine Dynamic Time-slice Scaling with DVFS for Coordinating Thermal and Fairness on CPU
Combine Dynamic Time-slice Scaling with DVFS for Coordinating Thermal and Fairness on CPU Gangyong Jia Department of Computer Science and Technology Hangzhou Dianzi University Hangzhou, China gangyong@hdu.edu.cn
More informationDrowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 6-1-2012 Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy Brendan Fitzgerald
More informationOnline Work Maximization under a Peak Temperature Constraint
Online Work Maximization under a Peak Temperature Constraint Thidapat Chantem Department of CSE University of Notre Dame Notre Dame, IN 46556 tchantem@nd.edu X. Sharon Hu Department of CSE University of
More informationParallel Longest Common Subsequence using Graphics Hardware
Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related
More informationTDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II
TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:
More informationCaches in WCET Analysis
Caches in WCET Analysis Jan Reineke Department of Computer Science Saarland University Saarbrücken, Germany ARTIST Summer School in Europe 2009 Autrans, France September 7-11, 2009 Jan Reineke Caches in
More informationRobust Optimization of a Chip Multiprocessor s Performance under Power and Thermal Constraints
Robust Optimization of a Chip Multiprocessor s Performance under Power and Thermal Constraints Mohammad Ghasemazar, Hadi Goudarzi and Massoud Pedram University of Southern California Department of Electrical
More informationPotentials of Branch Predictors from Entropy Viewpoints
Potentials of Branch Predictors from Entropy Viewpoints Takashi Yokota,KanemitsuOotsu, and Takanobu Baba Department of Information Science, Utsunomiya University, 7 2 Yoto, Utsunomiya-shi, Tochigi, 32
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationEvaluating Sampling Based Hotspot Detection
Evaluating Sampling Based Hotspot Detection Qiang Wu and Oskar Mencer Department of Computing, Imperial College London, South Kensington, London SW7 2AZ, UK {qiangwu,oskar}@doc.ic.ac.uk http://comparch.doc.ic.ac.uk
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationEvaluating Linear Regression for Temperature Modeling at the Core Level
Evaluating Linear Regression for Temperature Modeling at the Core Level Dan Upton and Kim Hazelwood University of Virginia ABSTRACT Temperature issues have become a first-order concern for modern computing
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationThermal Scheduling SImulator for Chip Multiprocessors
TSIC: Thermal Scheduling SImulator for Chip Multiprocessors Kyriakos Stavrou Pedro Trancoso CASPER group Department of Computer Science University Of Cyprus The CASPER group: Computer Architecture System
More informationEnergy-Efficient Management of Reconfigurable Computers 82. Processor caches are critical components of the memory hierarchy that exploit locality to
Energy-Efficient Management of Reconfigurable Computers 82 4 cache reuse models 4.1 Overview Processor caches are critical components of the memory hierarchy that exploit locality to keep frequently-accessed
More informationUSING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1
USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1 Sung Woo Chung and Kevin Skadron Division of Computer Science and Engineering, Korea University, Seoul 136-713, Korea
More informationSome thoughts about energy efficient application execution on NEC LX Series compute clusters
Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science
More informationLeakage Minimization Using Self Sensing and Thermal Management
Leakage Minimization Using Self Sensing and Thermal Management Alireza Vahdatpour Computer Science Department University of California, Los Angeles alireza@cs.ucla.edu Miodrag Potkonjak Computer Science
More informationEDF Feasibility and Hardware Accelerators
EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo, Waterloo, Canada, arrmorton@uwaterloo.ca Wayne M. Loucks University of Waterloo, Waterloo, Canada, wmloucks@pads.uwaterloo.ca
More informationAndrew Morton University of Waterloo Canada
EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo Canada Outline 1) Introduction and motivation 2) Review of EDF and feasibility analysis 3) Hardware accelerators and scheduling
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationComputer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy
Computer Architecture ESE 345 Computer Architecture Performance and Energy Consumption 1 Two Notions of Performance Plane Boeing 747 DC to Paris 6.5 hours Top Speed 610 mph Passengers Throughput (pmph)
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationReducing the Run-time of MCMC Programs by Multithreading on SMP Architectures
Reducing the Run-time of MCMC Programs by Multithreading on SMP Architectures Jonathan M. R. Byrd Stephen A. Jarvis Abhir H. Bhalerao Department of Computer Science University of Warwick MTAAP IPDPS 2008
More informationECE 571 Advanced Microprocessor-Based Design Lecture 9
ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 20 February 2018 Announcements HW#4 was posted. About branch predictors Don
More informationCache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms
University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science -25 Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms
More informationENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING
ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING Abhishek Mishra and Anil Kumar Tripathi Department of
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationLecture 2: Paging and AdWords
Algoritmos e Incerteza (PUC-Rio INF2979, 2017.1) Lecture 2: Paging and AdWords March 20 2017 Lecturer: Marco Molinaro Scribe: Gabriel Homsi In this class we had a brief recap of the Ski Rental Problem
More informationMICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE
MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By
More informationAnnouncements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday
Announcements Project #1 grades were returned on Monday Requests for re-grades due by Tuesday Midterm #1 Re-grade requests due by Monday Project #2 Due 10 AM Monday 1 Page State (hardware view) Page frame
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationTiming analysis and timing predictability
Timing analysis and timing predictability Caches in WCET Analysis Reinhard Wilhelm 1 Jan Reineke 2 1 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA ArtistDesign Summer
More informationMicroarchitectural Techniques for Power Gating of Execution Units
2.2 Microarchitectural Techniques for Power Gating of Execution Units Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, Pradip Bose IBM T. J. Watson Research Center ABSTRACT
More informationAccurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses
Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses Krishnan Sundaresan and Nihar R. Mahapatra Department of Electrical & Computer Engineering Michigan State University, East Lansing,
More informationCS 700: Quantitative Methods & Experimental Design in Computer Science
CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationToward Precise PLRU Cache Analysis
Toward Precise PLRU Cache Analysis Daniel Grund Jan Reineke 2 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA Workshop on Worst-Case Execution-Time Analysis 2 Outline
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationUTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement
UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li
More informationCache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms
University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 12-2013 Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationEnhancing Reuse of Constraint Solutions to Improve Symbolic Execution
Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution Xiangyang Jia (Wuhan University) Carlo Ghezzi (Politecnico di Milano) Shi Ying (Wuhan University) Outline Motivation Logical Basis
More informationBranch History Matching: Branch Predictor Warmup for Sampled Simulation
Branch History Matching: Branch Predictor Warmup for Sampled Simulation Simon Kluyskens Lieven Eeckhout ELIS Department, Ghent University Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium Email: leeckhou@elis.ugent.be
More informationProcess Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach
Process Scheduling for RTS Dr. Hugh Melvin, Dept. of IT, NUI,G RTS Scheduling Approach RTS typically control multiple parameters concurrently Eg. Flight Control System Speed, altitude, inclination etc..
More informationHigh-Performance Computing, Planet Formation & Searching for Extrasolar Planets
High-Performance Computing, Planet Formation & Searching for Extrasolar Planets Eric B. Ford (UF Astronomy) Research Computing Day September 29, 2011 Postdocs: A. Boley, S. Chatterjee, A. Moorhead, M.
More informationComplex Dynamics of Microprocessor Performances During Program Execution
Complex Dynamics of Microprocessor Performances During Program Execution Regularity, Chaos, and Others Hugues BERRY, Daniel GRACIA PÉREZ, Olivier TEMAM Alchemy, INRIA, Orsay, France www-rocq.inria.fr/
More informationA Component Model of Spatial Locality
A Component Model of Spatial Locality Xiaoming Gu Intel China Research Center xiaoming@cs.rochester.edu Ian Christoper Tongxin Bai Department of Computer Science University of Rochester {ichrist2,bai}@cs.rochester.edu
More informationProbabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks
Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks Abhilash Thekkilakattil, Radu Dobrin and Sasikumar Punnekkat Mälardalen Real-Time Research Center, Mälardalen University,
More informationThroughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint
2011 IEEE/ACM International Conference on Green Computing and Communications Throughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint Guanglei Liu 1, Gang Quan 1, Meikang
More informationParallel Real-Time Task Scheduling on Multicore Platforms
Parallel Real-Time Task Scheduling on Multicore Platforms James H. Anderson and John M. Calandrino Department of Computer Science, The University of North Carolina at Chapel Hill Abstract We propose a
More informationarxiv: v1 [cs.os] 6 Jun 2013
Partitioned scheduling of multimode multiprocessor real-time systems with temporal isolation Joël Goossens Pascal Richard arxiv:1306.1316v1 [cs.os] 6 Jun 2013 Abstract We consider the partitioned scheduling
More informationCPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When
1 CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When Inkwon Hwang, Student Member and Massoud Pedram, Fellow, IEEE Abstract
More informationTiming analysis and predictability of architectures
Timing analysis and predictability of architectures Cache analysis Claire Maiza Verimag/INP 01/12/2010 Claire Maiza Synchron 2010 01/12/2010 1 / 18 Timing Analysis Frequency Analysis-guaranteed timing
More informationModule 5: CPU Scheduling
Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 5.1 Basic Concepts Maximum CPU utilization obtained
More informationVMware VMmark V1.1 Results
Vendor and Hardware Platform: IBM System x3950 M2 Virtualization Platform: VMware ESX 3.5.0 U2 Build 110181 Performance VMware VMmark V1.1 Results Tested By: IBM Inc., RTP, NC Test Date: 2008-09-20 Performance
More informationChapter 6: CPU Scheduling
Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 6.1 Basic Concepts Maximum CPU utilization obtained
More informationONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS
ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS Richard A. Dutton and Weizhen Mao Department of Computer Science The College of William and Mary P.O. Box 795 Williamsburg, VA 2317-795, USA email: {radutt,wm}@cs.wm.edu
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationDepartment of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.
Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationA Data Communication Reliability and Trustability Study for Cluster Computing
A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX HPC Introduction Relevant to a variety of sciences,
More informationA Novel Software Solution for Localized Thermal Problems
A Novel Software Solution for Localized Thermal Problems Sung Woo Chung 1,* and Kevin Skadron 2 1 Division of Computer and Communication Engineering, Korea University, Seoul 136-713, Korea swchung@korea.ac.kr
More informationSummarizing Measured Data
Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationSummarizing Measured Data
Performance Evaluation: Summarizing Measured Data Hongwei Zhang http://www.cs.wayne.edu/~hzhang The object of statistics is to discover methods of condensing information concerning large groups of allied
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More informationFormal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers
Formal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers Sarani Bhattacharya and Debdeep Mukhopadhyay Indian Institute of Technology Kharagpur PROOFS 2016 August
More informationRuntime Model Predictive Verification on Embedded Platforms 1
Runtime Model Predictive Verification on Embedded Platforms 1 Pei Zhang, Jianwen Li, Joseph Zambreno, Phillip H. Jones, Kristin Yvonne Rozier Presenter: Pei Zhang Iowa State University peizhang@iastate.edu
More informationCyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism
Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu
More informationNon-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks
Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks Moonju Park Ubiquitous Computing Lab., IBM Korea, Seoul, Korea mjupark@kr.ibm.com Abstract. This paper addresses the problem of
More informationPredictability of Least Laxity First Scheduling Algorithm on Multiprocessor Real-Time Systems
Predictability of Least Laxity First Scheduling Algorithm on Multiprocessor Real-Time Systems Sangchul Han and Minkyu Park School of Computer Science and Engineering, Seoul National University, Seoul,
More information