THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

Size: px

Start display at page:

Download "THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University"

Derick Richards
6 years ago
Views:

1 THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21

2 Executive Summary 2 Mitigating the memory wall requires large, highly associative caches Last-level caches take ~5% chip area, have ways in latest CMPs More ways large energy, latency and area overheads ZCache: A highly associative cache with a low number of ways Improves associativity by increasing number of replacement candidates Retains low energy/hit, latency and area of caches with few ways Based on skew-associative caches and cuckoo hashing Analytical framework explains why zcache works Associativity depends on number of replacement candidates, not ways or locations a block can be in

3 Outline 3 Introduction ZCache Analytical Framework Evaluation

4 Introduction 4 Uses of high associativity: Improve performance by reducing conflict misses Partitioning, pinning, storing speculative data ( e.g. TM, TLS) Increasing number of ways affects area, delay, energy Increase over 4-way (%) way 32-way 11% Area Hit Latency Hit Energy IPC improvement vs 4-way (%) way rand 32-way ammp_m

5 Techniques for high associativity (1/2): Increase number of locations 5 Allow multiple locations per way Column-associative caches [Agarwal93], set-balancing cache [Rolan9], Hit latency, hit energy Use a victim cache VC [Jouppi9], Scavenger [Basu7], Area, hit latency, hit energy Use indirection in the tag array IIC [Hallnor], V-Way cache [Qureshi5] Area, hit latency, hit energy

6 Techniques for high associativity (2/2): Better hashing Use a hash function to index the cache Simple hashing significantly reduces conflicts [Karbutli4] Skew-associative caches [Seznec93] Index each way using a different hash function A line conflicts with a different set of lines on each way, reducing conflict misses No sets, cannot use replacement policy that relies on set ordering 6 Indexes Line address H Hash function Set index Way Way1 Way2 Line address H H1 H2 Way Way1 Way2

7 Outline 7 Introduction ZCache Analytical Framework Evaluation

8 The ZCache Design 8 Lookups and hits happen as in a skew-associative cache Line address H H1 Indexes H2 Way Way1 Way2 Misses exploit the multiple hash functions to obtain an arbitrarily large number of replacement candidates Phase 1: Walk the tag array, get best candidate Phase 2: Move a few lines to fit everything This happens infrequently (on misses) and off the critical path Draws on prior research in cuckoo hashing

9 D M ZCache Replacement 9 U F N B P V C D E K M X J R H H H Y 5 4 Way Way 1 Way 2 Start replacement process while fetching Y A G Z T Q I H2 L O S A MISS

10 ZCache Replacement U F N B P V C D E K M X J R H H H Y Way Way 1 Way 2 D M A G Z T Q I H2 L O S A

11 ZCache Replacement 11 U F N B P V C D E K M X J R H H H A 5 2 Way Way 1 Way 2 K X Instead of evicting A, can move it and evict K or X A G Z T Q I H2 L O S A

12 ZCache Replacement 12 H H1 H2 1 st -level candidates Addr Y A D M H H H2 1 7 Way Way 1 Way 2 U F P V C K M X H B E R N D J A Z Q G T I L O S

13 ZCache Replacement 13 H H1 H2 1 st -level candidates Addr Y A D M H H H2 1 7 Way Way 1 Way 2 U F P V C K M X H B E R N D J A Z Q G T I L O S 2 nd -level candidates B K X P Z S

14 ZCache Replacement 14 H H1 H2 Way Way 1 Way 2 U F P V C K M X H B E R N D J A Z Q G T I L O S Addr Y A D M B K X P Z S H H H

15 ZCache Replacement 15 Y A D M K X B S P Z L M N E T X F K E Q G R Chosen by replacement policy (e.g. LRU block) Addr Y A D M B K X P Z S H H H

16 ZCache Replacement Y A K X 16 D B Z M P S L M N E U F N B P A G V C D E K Z T M X J R H Q I L O S Y T X G R E Q F K

17 ZCache Replacement 17 Y A D M K X B S P Z L M N E T X F K E Q G R U V M F C A P B K E H R X D J Y G Z T Q I L O S

18 ZCache Replacement 18 Hits always take a single lookup H 5 Y H1 4 H2 Way Way 1 Way 2 U F P V C K M A H B E R X D J Y Z Q G T I L O S M HIT

19 ZCache Implementation Overview Replacements take place: Off the critical path Concurrently with other operations Walk accesses are pipelined Do not saturate tag bandwidth in practice No effect on hit latency Energy per miss mostly determined by walk Similar to set-associative cache of same associativity Cheap to implement SRAM with 1s of bits to track candidates Leverages existing MSHRs See paper for more details 19

20 Number of Candidates An L-level walk on a W-way zcache gets R candidates: 2 R = W L n= ( W 1) n L W Few ways (W=4) give many candidates with shallow walks Ratio of tag bandwidth vs bandwidth of next level limits number of candidates

21 Outline 21 Introduction ZCache Analytical Framework Evaluation

22 An Analytical Associativity Framework 22 Comparing associativity across cache designs is hard Ways do not mean much Conflict misses are workload and architecture-specific Goals Find a general way to characterize associativity Analyze what determines the performance of a zcache

23 General Cache Model 23 Cache array: Holds tags and data Implements associative lookup by address On a replacement, gives list of replacement candidates Model assumes nothing about array organization Replacement policy: Maintains a global rank of which cache blocks to replace All policies conceptually do (LRU, LFU, OPT, ) Implementation does not need to

24 Associativity Distribution Eviction priority: Rank of a block given by the replacement function, normalized to [,1] Higher is better to evict 24 Associativity distribution: Probability distribution of the eviction priorities of evicted blocks Higher associativity distribution more skewed towards 1. Measures how well the array does, not the replacement policy For good performance, replacement policy also needs to do a good job!

25 Uniformity Assumption If the cache array gives R replacement candidates with uniformly distributed priorities, 25 F A E1,..., ER~ i. i. d. U[,1] A = max{ E 1,..., ER } ( x) = P( A x) = x R, x [,1]

26 Associativity Distributions in Practice 26 Set-associative caches do significantly worse than UA Hashing (H 3 ) improves associativity, but still sensibly worse than UA

27 Associativity Distributions for ZCaches 27 Skew-associative caches (1-level zcaches) are very close to UA Increasing candidates but not ways still yields distrib very close to UA

28 Analytical Framework: Conclusions In caches with good hashing, the number of replacement candidates R determines associativity 28 ZCaches provide large number of candidates with few ways Decouple ways and associativity

29 Outline 29 Introduction ZCache Analytical Framework Evaluation

30 Methodology Infrastructure: CACTI-based models for cache cost estimates McPAT for full-cmp area, power estimations Microarchitectural simulation with Pin-based simulator Target system: 32 in-order x86-64 cores (single-issue, 2GHz, 32KB I/D L1s) Fully shared L2, 8MB, 8 1MB banks (set-assoc/zcache) All L2 caches use hashing (H 3 ) 72 workloads: Multithreaded: PARSEC, SPECOMP Multiprogrammed: SPECCPU26 See paper for more details 3

31 Cache Costs Area (mm2) SA 4-way SA 16-way SA 32-way Z 4/16 Z 4/ Hit Latency (ns) Hit Energy (nj) Miss Energy (nj) 31 Each design is optimized for area*latency*energy ZCaches: Retain hit area, hit latency, hit energy of a 4-way SA cache Energy per miss comparable to similarly-associative SA cache

32 Performance and Energy-Efficiency 32 IPC improvement vs 4-way (%) BIPS/W improv vs 4-way (%) SetAssoc 32-way Z 4-way/52-rc ammp_m rand cactusadm gmean (72) gmean (1) ammp_m rand cactusadm gmean (72) gmean (1)

33 Conclusions ZCaches enable efficient highly-associative caches Low number of ways Associativity gained by increasing replacement candidates Costs of high associativity (energy, tag bandwidth) paid only on misses Analytical framework shows that replacement candidates determine associativity 33

34 THANK YOU FOR YOUR ATTENTION QUESTIONS?

35 Backup: Replacement Timeline 35 Address for read/write Tag port out/in Data port out/in Time Way Way1 Way2 Way Way1 Way2 Way Way1 Way2 Miss Walk Relocations A B P L N G F N A X Y D K Z T E E K M X S X M Q R X A A N A X Y D M X A Memory bus Y N Y Fetch on miss Writeback (if needed) Miss response

36 Backup: LRU with coarse-grain timestamps 36 8-bit timestamp per tag Tag each block with a global timestamp counter Increment timestamp every k=5% accesses Wraparounds are rare Current TS Timestamp distrib Timestamp 255

Vector Lane Threading

Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program