Cache Contention and Application Performance Prediction for Multi-Core Systems

Size: px

Start display at page:

Download "Cache Contention and Application Performance Prediction for Multi-Core Systems"

Marcia Baker
5 years ago
Views:

1 Cache Contention and Application Performance Prediction for Multi-Core Systems Chi Xu, Xi Chen, Robert P. Dick, Zhuoqing Morley Mao University of Minnesota, University of Michigan IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), March / 13

2 Motivation Multiprocessor architectures (CMP) with shared last-level caches + Inter-process communication + Heterogeneous cache allocation Contention 2 / 13

3 Motivation he rest of this paper is organized as follows. Section II ents related work. Sections III and IV motivate and ribe CAMP. Section V introduces an automated way haracterize process memory access behavior to permit Multiprocessor architectures (CMP) with shared last-level caches prediction of cache contention. Section VI presents and usses the experimental validation process and results. lly, Section VII summarizes our work. + Inter-process communication + Heterogeneous II. cache RELATED WORK allocation ast work [6], [7], [8], [9] has considered the problem djusting cache Contention partitioning during run time after process gnment decisions have already been made. In contrast, goal of our work is to predict the performance implins of process assignment decisions before execution. er researchers have developed performance prediction els to guide process assignment. However, most [10], addressed cache contention only for uniprocessors on ch only a single process may run at a time. The move MPs will aggravate the cache contention problem since tiple processes can run on different cores simultaneously. esource contention models for simultaneous multithread- Normalized Execution Time Figure 1. with it. Performance implications of core assignment 1 art mcf bzip2 swim equake mesa vpr ammp mgrid applu Cache Misses per L2 Access Impact of stressmark on performance of processes sharing case models use the reuse distances and/or circular sequence profiles for each thread to predict inter-thread cache contention. These models require knowledge of the steady-state L2 cache access frequency of a process when concurrently running 2 / 13

4 Goal Model cache contention Easy and automatic No modifications to existing hardware or operating system No exhaustive offline simulation Complementary to existing work 3 / 13

5 Analytical Model System N-core processor On-chip last-level L2 Cache Set-associative (ways = lines per set) LRU replacement policy Shared among cores No Prefetching Applications in steady state 4 / 13

6 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache 5 / 13

7 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) i=1 A Associativity of cache Reuse Distance Probability (%) Reuse distance Figure 2. Cache line reuse distance histogram for mcf application. execution time of art increased by 120% while that of Reuse Distance: We define the reuse distanc cache line j to be the number of distinct cache lin the same set accessed between two consecutive ac line j. A reuse distance histogram represents the d of cache line reuse distances for an entire shar Given an A-way set-associative cache, Figure 2 show distance histogram for the mcf application (see Se The x-axis shows the reuse distance and the y-a the normalized frequencies of the associated reuse The first bar in the histogram, i.e., hist 1, gives the p that a most-recently-used line will be accessed ag the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13

8 Analytical Model Applications I N Total number of processes N S i = A S i Effective cache size of process i (ways occupied by i) Probability (%) i=1 A Associativity of cache Reuse Distance 30 Reuse Distance: We define the reuse distanc MPA 25 cache line i (S j i ) = hist to be the number i (x) dx S i of distinct cache lin the same set accessed between two consecutive ac 20 line j. A reuse distance histogram represents the d 15 MPA of cache Probability line reuseofdistances cache miss for an forentire shar 10 Given process an A-way i set-associative cache, Figure 2 show 5 distance histogram for the mcf application (see Se hist TheLinear x-axis interpolation shows the reuseof distance reuse and the y-a the distance normalizedhistgram frequencies of the associated reuse Reuse distance The first bar in the histogram, i.e., hist 1, gives the p Figure 2. Cache line reuse distance histogram for mcf application. that a most-recently-used line will be accessed ag execution time of art increased by 120% while that of the last bar, i.e., hist 13+, gives the probability tha for the next cache access does not exist in the5 / most 13

9 Analytical Model Applications II Cache Accesses APS = API SPI APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction 6 / 13

10 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) 6 / 13

11 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13

12 Analytical Model Applications II Cache Accesses APS = API SPI SPI = α MPA + β APS Accesses per second API Accesses per instruction (fixed for each application) SPI Seconds per instruction α Off-chip latency (memory, disk) β On-chip latency (computation) n G i (n) = (P s,n s) s=1 steady state n = G 1 i (S i ) G i (n) Effective cache size of process i after n accesses P s,n Probability of having s cache lines after n consecutive accesses 6 / 13

13 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache 7 / 13

14 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) 7 / 13

15 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T 7 / 13

16 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i Reminder APS = API SPI SPI = α MPA + β 7 / 13

17 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β 7 / 13

18 Analytical Model Applications III At time t there is a duration T such that data accessed... before t T are evicted from cache during [t T, t] are present in cache Assuming all processes are in steady state: S i = G i (APS i T ) APS i = G 1 i (S i )/T APS i = G 1 i (S i ) T = API i α i MPA i (S i )+β i T = G 1 i (S i ) α i MPA i (S i )+β i API i Reminder APS = API SPI SPI = α MPA + β i=1 N S i = A i=1 N j=1 : G 1 1 (S 1) G 1 j (S j ) API 1 (α j MPA j (S j ) + β j ) N API i (α 1 MPA 1 (S 1 ) + β 1 ) and S i A = 0 7 / 13

19 Automated Profiling Two processes running on separate cores sharing A-way last-level cache One process uses l ways other process uses A l ways stressmark: synthetic application with configurable cache occupation Gather information on API, MPA and SPI via hardware performance counters Derive reuse distance histogram, effective cache size (S), α and β application-dependent feature vector 8 / 13

20 Evaluation Intel Core 2 Duo-P8600 (2 2.4GHz, 3 MB 12-way associative L2 cache) MacOS X 10.5 Profiling via Shark at a period of 2 ms Subset of SPEC CPU2000: 5 CPU-intensive + 5 memory-intensive Each application run 12 times for 10 s to determine characteristics Examine all 55 pairwise combinations 9 / 13

21 Application profiles TABLE II API, α, AND β FOR DIFFERENT BENCHMARKS Benchmark art mcf bzip2 swim equake mesa vpr ammp mgrid applu API α ( 10 9 ) β ( 10 7 ) Miss Rate art Miss Rate mcf Miss Rate vpr Miss Rate mesa Miss Rate mgrid Miss Rate swim Miss Rate ammp Miss Rate applu Figure 3. Profiled cache miss rate corresponding to effective cache size. proposed by Chandra et al. [5] requires the steady-state cache access frequency of a process to be known a priori. We see no practical way to accurately predetermine this value for by AB, MB, and CAMP. AB and MB are not past work. They are in fact alternative prediction models we considered. Table III presents the average prediction error in cache10 / 13

22 Prediction Accuracy TABLE III PREDICTION ACCURACY FOR CACHE MISSES AND PERFORMANCE DEGRADATION CAMP AB MB MPA SPI MPA SPI MPA SPI Benchmark Error >5% Error >5% Error >5% Error >5% Error >5% Error >5% (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) art vpr mcf ammp bzip mesa swim equake applu mgrid top 5 average average also explains why memory-intensive benchmarks have r estimation error than CPU-intensive benchmarks. In III, the bottom 5 benchmarks are either CPU-intensive cations or streaming applications with constant high rates, e.g., swim. Their performance estimation errors are because it uses monotonic non-linear functions. This m significantly reduce computational cost when the numbe cores is large. In addition, since the three models are ba on estimating the effective cache sizes of two processes, give the same results when two instances of art are runn 11 / 13

23 Generality art art 12-way 3M 16-way 4M 24-way 6M Miss Rate / 13

24 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% 13 / 13

25 Conclusion Summary Predictive model of contention on shared last-level cache Automated profiling and extraction of feature vector No modification of hardware or operating system Average error of <1.6% Discussion Varying input data Benchmarking crimes Generalisation Practical application 13 / 13

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,