ecture 1 3 C a ch e B a s i cs a n d C a ch e P erf o rm a n ce Computer Engineering 585 F a l l 2 0 0 2
What Is emory ierarchy typical memory hierarchy today "! '& % ere we focus on 1/2/3 caches and main memory
Why emory ierarchy Performance 1000 100 10 1 1980 1981 oore s aw 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU D µproc 60%/yr. Processor-emory Performance ap grows 50% / year) D 7%/yr. 1980 no cache in µ p r oc; 1995 2 - l ev el cache on chip 1989 f ir s t I nt el µ p r oc w it h a cache on chip )
en erat ion s of icrop roces s ors Time of a full cache miss in instructions executed 1 st lp ha 3 4 0 ns/ 5. 0 ns 6 8 clk s x 2 or 1 3 6 2 nd lp ha 2 6 6 ns/ 3. 3 ns 8 0 clk s x 4 or 3 2 0 3 rd lp ha 1 8 0 ns/ 1. 7 ns 1 0 8 clk s x 6 or 6 4 8 1 / 2 X latency x 3 X clock rate x 3 X I nstr/ clock 4.5X
rea Costs of Caches P r o c e s s o r % r e a % T r a n s i s t o r s - c o s t ) - p o w e r ) I n t e l 8 0 3 8 6 0 % 0 % l p h a 2 1 1 6 4 3 7 % 7 7 % t r o n g r m 1 1 0 6 1 % 9 4% P e n t i u m P r o 6 4% 8 8 % 2 dies per package Proc/I/D + 2 It an iu m 9 2% C ach es st ore redu n dan t dat a on l y t o cl ose perf orm an ce gap
" % % What Is Exactly Cache mall, fast storage used to improve average access time to slow memory ; usually made b y Exploits locality spatial and temporal I n compu ter arch itectu re, almost ev eryth ing is a cach e!! % &! B eyond arch itectu re f ile cach e, b row ser cach e, proxy cach e ere w e f ocu s on 1 and 2 cach es 3 optional) as b u f f ers to main memory
- * 1 " % Example 1 KB Direct apped Cache b l ock s, b l ock si z e of 2 b y t es, 2 ssume a cache of 2 b y t es; N + K b l ock t i mes b l ock si z e) The cache stores tag, data, and valid bit for each block *, * +, * ' ) ' & %! " " / 0 /. % 2. " / 0 1 0 4 9 Block address 31 Block offset Ex 0x00 Index Ex 0x01 Tag Example 0x50 tored as part of the cache state 0 1 2 3 Byte 0 Byte 32 Cache Data Byte 31 Byte 63 Cache Tag Valid Bit Byte 1 Byte 33 0x50 Byte 992 31 Byte 1023
Four Questions bout Cache Design Block placement W h er e can a b lock b e placed Block i d enti f i cati on ow to f i nd a b lock i n th e cach e Block r eplacement I f a new b lock i s to b e f etch ed, w h i ch of ex i s ti ng b locks to r eplace i f th er e ar e mu lti ple ch oi ce) W r i te poli cy W h at h appens on a w r i te
Where Can Block Be Placed W h at i s a b lock d i v i d e memor y s pace i nto b locks as cach e i s d i v i d ed memory block is the basic unit to be cached Direct mapped cache there is only one place in the cache to b u f f er a g iv en memory b lock N - w ay set associativ e cache N places f or a g iv en memory b lock ike N direct map p ed caches op erating in p arallel educing miss rates w ith increased comp lex ity, cache access time, and p ow er consump tion F u lly associativ e cache a memory b lock can b e pu t anyw here in the cache
p et ssociative Cache E x amp le T w o-w ay set associativ e cache C ache index selects a set of tw o blocks T he tw o tag s in the set are comp ared to the inp ut in arallel Data is selected based on the tag comparison et associative or direct mapped Discuss later Valid Cache Tag Cache Data Cache Block 0 Cache Index Cache Data Cache Block 0 Cache Tag Valid dr Tag Compare el1 1 ux 0 el0 Compare it O Cache Block
ow to Find a Cached Block Direct mapped cach e th e stored tag f or th e cach e b lock match es th e in put tag F ully associative cach e an y of th e stored N tag s match es th e in put tag et associative cach e an y of th e stored K tag s f or th e cach e set match es th e in put tag C ach e h it laten cy is decided b y b oth tag comparison an d data access
W hich Block to ep lace Direct mapped cach e N ot an issue F or set associative or f ully associative* cach e andom elect candidate block s randomly f rom the cache set U east ecently U sed) eplace the block that has been u nu sed f or the longest time F I F O F irst I n, F irst O u t) eplace the oldest block U sually U perf orms th e b est, b ut h ard an d ex pen sive) to implemen t
What appens on Writes Where to write the data if the b l oc k is f ou n d in c ac he Write throu g h n ew data is written to b oth the c ac he b l oc k an d the l ower-l ev el m em ory el p to m ain tain c ac he c on s is ten c y Write b ac k n ew data is written on l y to the c ac he b l oc k ower-l ev el m em ory is u p dated when the b l oc k is rep l ac ed dirty b it is u s ed to in dic ate the n ec es s ity el p to redu c e m em ory traf f ic What hap p en s if the b l oc k is n ot f ou n d in c ac he Write al l oc ate F etc h the b l oc k in to c ac he, then write the data u s u al l y c om b in ed with write b ac k ) N o-write al l oc ate D o n ot f etc h the b l oc k in to c ac he u s u al l y c om b in ed with write throu g h)
eal Example lpha 21264 Caches 64KB 2-w a y a s s o c i a t i v e i n s t r u c t i o n c a c h e 64KB 2-w a y a s s o c i a t i v e d a t a c a c h e I- c a c h e D - c a c h e
! + - &. / lpha 21264 Data Cache D- c a c h e 6 4 K 2 - w a y a s s o c i a t i v e &' % " * ) -, ) ' ) - &' % 1 0 0 1 & )
Cache performance C a l c u l a t e a v e r a g e m e m o r y a c c e s s t i m e T ) T hit time + iss rate iss penalty E x a m p l e h i t t i m e 1 c y c l e, m i s s t i m e 1 0 0 c y c l e, miss rate 4%, than T 1+100*4% 5 Calculate cache impact on processor perf ormance CPU time CPU execution cycles + emory stall cycles) Cycle time CPU time IC CPI emory tall Cycles Instruction execution + CycleTime N o te c y c l es sp ent o n c ac he hit is u su al l y c o u nted into ex ec u tio n c y c l es I f clock cy cle is id entical, b etter T means b etter perf ormance
i i c l c i c c o c 1 / * 2 2 QP O K C B N QP O K C B X Example Evaluating plit Inst/Data Cache Unified v s p l it I ns t / da t a c a c h e a r v a r d r c h it ec t u r e) g h ef a p c ^_`a g4h ef a bdc n ml h ^_`a j[k g4h e f a n ml h ljk o g h ef a n ml h ljk g4h e f a E x a m p l e o n p a g e 4 0 6 / 4 0 7 "! --, * ) +* ) % &' %. ) /. 0/ W h ic h des ig n is b et t er ) ) 2 " -79 2 -- 2 8 6 576 4 3 E P EN C O F E IK F F E N IK F E D ; E E E N C O F ]\ N K IK F F ]\ N IK F E D YX[Z T UWV
Disadvantage of et ssociative Cache Compare n-w ay s et as s oc i at i v e w i t h d i rec t mapped c ac h e as n c omparat ors v s. 1 c omparat or as E x t ra U X d el ay f or t h e d at a D at a c omes af t er h i t / mi s s d ec i s i on and s et s el ec t i on In a direct mapped cache, cache block is available before hit/ miss decision U se the data assu ming the access is a hit, recover if Valid fou nd otherw ise Cache Tag Cache Data Cache Block 0 Cache Index Cache Data Cache Block 0 Cache Tag Valid dr Tag Compare el1 1 ux 0 el0 Compare it O Cache Block
Example Evaluating et ssociative Cache "! % & ' ) * + -,. / 0 1 ' 32, & 0 4 '65 1 ' )7, 4 8 9 8 * ; B B CED F% EI KF N O7P Q NE T Q3U V W X ; ;-Y D F F Q N I F N O7P Q E Z [ X Y \7] ; X ^ W _ Z` ; T _ Z` a; F P b _ Q F I F N KF N O P c dn b _ T _ Z` a; Q P b _ Q F F% Q N I F% N F N O P c e c b _ f g h g a C X X ; C; ^ _ Z b ; ji ; X U X W ; ] ; X B \;-k C
C V ^ YX W T YX T ; ; [ C YX W T T Y B \\ [ ] X T Evaluating Cache Performance for Outof- ord er Proces s ors ecall T h i t t i m e + m i s s r at e x m i s s p en alt y V er y d i f f i cu lt t o d ef i n e m i s s p en alt y t o f i t i n t h i s s i m p le m o d el, i n t h e co n t ex t o f O O O p r o ces s o r s ; B B \ C ; X ; X B] _ X X [ X ; B B \ C ; X X X _ X W X W e m ay as s u m e a cer t ai n p er cen t ag e o f o v er lap p i n g \ [; ; ; Y ; B B \ ji ; B b X i ; B B \ YX ; X ] ; a ; \ ^ C; [ X B ] C ach e h i t t i m e can als o b e o v er lap p ed \\ W ] ] b; Z _ a X C; X ; X W ;
imple Example C o n s i d er an O O O p r o ces s o r s i n t o t h e p r ev i o u s ex am p le s li d e 1 8 ) low clock 1.25x base cycle time) D ir ect map p ed cach e O v er lap p in g d eg r ee of 3 0 % v er ag e miss p en alty 7 0 % * 7 5n s 52.5n s T 1.0 x1.25 + 0.0 14 x52.5) 1.9 9 n s C P U time I C x2x1.0 x1.25+ 1.5x0.0 14 x52.5)) 3.6 0 xi C C omp ar e 3.58 f or in -or d er + d ir ect map p ed, 3.6 3 f or in - or d er + two-way associativ e T h is is on ly a simp lif ied examp le; id eal C P I cou ld be imp r ov ed by O O O execu tion