Cache-Oblivious Algorithms - PDF Free Download

Cache-Oblivious Algorithms 1

Cache-Oblivious Model 2

The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program javac Java bytecode java Interpretation Can be executed on any machine with a Java interpreter 3

Cache-Oblivious Model CPU Memory I/O Disk I/O model Algorithms do not know the parameters B and M Optimal off-line cache replacement strategy Frigo et al. 1999 4

Justification of the ideal-cache model Optimal replacement LRU + 2 cache size at most 2 cache misses Sleator an Tarjan, 1985 Corollary T M,B (N) = O(T 2M,B (N)) #cache misses using LRU is O(T M,B (N)) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B (N) = O(T 2M,B (N)) optimal #cache misses on each level of a multilevel cache using LRU Fully associativity cache Simulation of LRU Direct mapped cache Explicit memory management Dictionary (2-universal hash functions) of cache lines in memory Expected O(1) access time to a cache line in memory 5

Matrix Multiplication 6

Matrix Multiplication Problem C = A B, c ij = k=1..n a ik b kj Layout of matrices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 32 40 48 56 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 Row major 0 8 16 24 32 40 48 56 0 1 2 3 16 17 18 19 0 1 4 5 16 17 20 21 1 9 17 25 33 41 49 57 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 2 10 18 26 34 42 50 58 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 3 11 19 27 35 43 51 59 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 4 12 20 28 36 44 52 60 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 5 13 21 29 37 45 53 61 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55 6 14 22 30 38 46 54 62 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 7 15 23 31 39 47 55 63 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 Column major 4 4-blocked Bit interleaved 7

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) s Partition A and B into blocks of size s s where s s = Θ( M) Apply Algorithm 1 to the N N matrices where s s elements are s s matrices 0 8 16 24 32 40 48 56 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) 0 Partition A and B into blocks of size s s where s 8 s = Θ( M) 16 24 Apply Algorithm 1 to the N N matrices where 32 s s 40 elements are s s matrices 48 56 s s-blocked or ( row major and M = Ω(B 2 ) ) ( ( ) ) 3 O N s s 2 = O ( ) ( ) N 3 B s B = O N 3 I/Os B M s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) 0 Partition A and B into blocks of size s s where s 8 s = Θ( M) 16 24 Apply Algorithm 1 to the N N matrices where 32 s s 40 elements are s s matrices 48 56 s s-blocked or ( row major and M = Ω(B 2 ) ) ( ( ) ) 3 O N s s 2 = O ( ) ( ) N 3 B s B = O N 3 I/Os B M s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 Optimal Hong & Kung, 1981 8

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 = A 11B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 8 recursive N 2 N 2 matrix multiplications + 4 N 2 N 2 # I/Os if bit interleaved or ( row major and M = Ω(B 2 ) ) T (N) T (N) O O( N 2 ) if N ε M B 8 T ( ) ( ) N + O N 2 otherwise ( N 3 2 B M ) B matrix sums 9

Matrix Multiplication Algorithm 4: Strassen s algorithm (cache-oblivious) 7 recursive N 2 N 2 C 11 C 12 C 21 C 22 matrix multiplications + O(1) matrix sums = A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 m 1 := (a 21 + a 22 a 11 )(b 22 b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 m 7 m 4 := (a 11 a 21 )(b 22 b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := (a 21 + a 22 )(b 12 b 11 ) m 6 := (a 12 a 21 + a 11 a 22 )b 22 m 7 := a 22 (b 11 + b 22 b 12 b 21 ) 10

Cache-Oblivious Search Trees 11

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Range reportings use O ( log B N + k B ) I/Os Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Range reportings use O ( log B N + k B ) I/Os Best possible (log 2 e + o(1)) log B N Prokop 1999 Bender, Brodal, Fagerberg, Ge, He, Hu Iacono, López-Ortiz 2003 12

Dynamic Cache-Oblivious Trees Embed a dynamic tree of small height into a complete tree Static van Emde Boas layout Rebuild data structure whenever N doubles of halves 1 3 4 5 6 8 7 11 10 13 Search O(log B N) Range Reporting Updates ) O ( log B N + k B O ( log B N + log2 N B ) Brodal, Fagerberg, Jacob 2001 13

Example 1 3 4 5 6 8 7 11 10 13 6 4 8 1 3 5 7 11 10 13 14

Binary Trees of Small Height 1 3 4 5 6 8 7 11 10 13 2 New 1 2 3 4 5 6 8 7 11 10 13 If an insertion causes non-small height then rebuild subtree at nearest ancestor with sufficient few descendents Insertions require amortized time O(log 2 N) Andersson and Lai 1990 15

Binary Trees of Small Height For each level i there is a threshold τ i = τ L + i, such that 0 < τ L = τ 0 < τ 1 < < τ H = τ U < 1 For a node v i on level i definethe density Insertion ρ(v i ) = # nodes below v i m i where m i = # possible nodes below v i with depth at most H Insert new element If depth > H then locate neirest ancestor v i with ρ(v i ) τ i and rebuild subtree at v i to have minimum height and elements evenly distributed between left and right subtrees Andersson and Lai 1990 16

Binary Trees of Small Height Theorem Insertions require amortized time O(log 2 N) Proof Consider two redistributions of v i After the first redistribution ρ(v i ) τ i Before second redistribution a child v i+1 of v i has ρ(v i+1 ) > τ i+1 Insertions below v i : m(v i+1 ) (τ i+1 τ i ) = m(v i+1 ) Redistribution of v i costs m(v i ), i.e. per insertion below v i Total insertion cost per element m(v i ) m(v i+1 ) 2 H i=0 2 = O(log2 N) Andersson and Lai 1990 17

Memory Layouts of Trees 2 1 9 4 8 12 3 6 10 13 2 6 10 14 4 5 7 8 11 12 14 15 1 3 5 7 9 11 13 15 DFS inorder 2 1 3 2 1 3 4 5 6 7 4 7 10 13 8 9 10 11 12 13 14 15 5 6 8 9 11 12 14 15 BFS van Emde Boas (in theory best) 18

Searches in Pointer Based Layouts 0.001 pointer bfs pointer dfs pointer veb pointer random insert pointer random layout 0.0001 1000 10000 100000 1e+06 van Emde Boas layout wins, followed by the BFS layout 19

Searches with Implicit Layouts 0.001 implicit bfs implicit dfs implicit veb implicit in-order implicit 9-ary bfs 0.0001 1000 10000 100000 1e+06 BFS layout wins due to simplicity and caching of topmost levels van Emde Boas layout requires quite complex index computations 20

Implicit vs Pointer Based Layouts 0.001 pointer bfs implicit bfs 0.001 pointer veb implicit veb 0.0001 1000 10000 100000 1e+06 BFS layout 0.0001 1000 10000 100000 1e+06 van Emde Boas layout Implicit layouts become competitive as n grows 21

Insertions in Implicit Layouts 0.1 implicit bfs random inserts implicit in-order random inserts implicit veb random inserts 0.01 0.001 0.0001 100000 1e+06 Insertions are rather slow (factor 10-100 over searches) 22

Summary Dynamic cache-oblivious search trees Search O(log B N) Range Reporting Updates ) O ( log B N + k B O ( log B N + log2 N B Update time O(log B N) by one level of indirection (implies sub-optimal range reporting) Importance of memory layouts van Emde Boas layout gives good cache performance Computation time is important when considering caches 6 4 8 1 3 5 7 11 10 13 ) 23

Cache-Oblivious Sorting 24

Sorting Problem Input : array containing x 1,..., x N Output : array with x 1,..., x N in sorted order Elements can be compared and copied 3 4 8 2 8 4 0 4 4 6 0 2 3 4 4 4 4 6 8 8 25

Binary Merge-Sort Ouput 0 2 3 4 4 4 4 6 8 8 Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging Input 3 4 8 2 8 4 0 4 4 6 26

Binary Merge-Sort Ouput 0 2 3 4 4 4 4 6 8 8 Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging Input 3 4 8 2 8 4 0 4 4 6 Recursive; two arrays; size O(M) internally in cache O(N log N) comparisons O ( N log ) B 2 N M I/Os 26

Merge-Sort Degree I/O 2 O ( N log ) B 2 N M d (d M 1) B Θ ( M B ) O ( N B log d N M O ( N log ) B M/B N M = O(SortM,B (N)) Aggarwal and Vitter 1988 Funnel-Sort 2 O( 1 ε Sort M,B(N)) (M B 1+ε ) Frigo, Leiserson, Prokop and Ramachandran 1999 ) Brodal and Fagerberg 2002 27

Lower Bound Brodal and Fagerberg 2003 Block Size Memory I/Os Machine 1 B 1 M t 1 Machine 2 B 2 M t 2 One algorithm, two machines, B 1 B 2 Trade-off 8t 1 B 1 + 3t 1 B 1 log 8Mt 2 t 1 B 1 N log N M 1.45N 28

Lower Bound Assumption I/Os Lazy Funnel-sort Binary Merge-sort B M 1 ε (a) B 2 = M 1 ε : Sort B2,M(N) (b) B 1 = 1 : Sort B1,M(N) 1 ε B M/2 (a) B 2 = M/2 : Sort B2,M(N) (b) B 1 = 1 : Sort B1,M(N) log M Corollary (a) (b) 29

Funnel-Sort 30

k-merger Frigo et al., FOCS 99 Sorted output stream M k sorted input streams 31

k-merger Frigo et al., FOCS 99 Sorted output stream M 0 k 1/2 -mergers M Recursive def. = B 1 B k buffers of size k 3/2 M 1 M k k sorted input streams 31

k-merger Frigo et al., FOCS 99 Sorted output stream M 0 k 1/2 -mergers M Recursive def. = B 1 B k buffers of size k 3/2 M 1 M k k sorted input streams M 0 B 1 M 1 B 2 M 2 B k M k Recursive Layout 31

Lazy k-merger Brodal and Fagerberg 2002 M 0 B 1 B k M 1 M k 32

Lazy k-merger Brodal and Fagerberg 2002 M 0 B 1 B k M 1 M k Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step Lemma If M B 2 and output buffer has size k 3 then O( k3 B log M(k 3 ) + k) I/Os are done during an invocation of Fill(root) 32

Funnel-Sort Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999 Divide input in N 1/3 segments of size N 2/3 Recursively Funnel-Sort each segment Merge sorted segments by an N 1/3 -merger k N 1/3 N 2/9 N 4/27. 2 33

Hardware Processor type Pentium 4 Pentium 3 MIPS 10000 Workstation Dell PC Delta PC SGI Octane Operating system GNU/Linux Kernel GNU/Linux Kernel IRIX version 6.5 version 2.4.18 version 2.4.18 Clock rate 2400 MHz 800 MHz 175 MHz Address space 32 bit 32 bit 64 bit Integer pipeline stages 20 12 6 L1 data cache size 8 KB 16 KB 32 KB L1 line size 128 Bytes 32 Bytes 32 Bytes L1 associativity 4 way 4 way 2 way L2 cache size 512 KB 256 KB 1024 KB L2 line size 128 Bytes 32 Bytes 32 Bytes L2 associativity 8 way 4 way 2 way TLB entries 128 64 64 TLB associativity Full 4 way 64 way TLB miss handler Hardware Hardware Software Main memory 512 MB 256 MB 128 MB 34

Wall Clock 100.0µs Pentium 4, 512/512 ffunnelsort funnelsort lowscosa stdsort ami_sort msort-c msort-m Wall clock time per element 10.0µs 1.0µs 0.1µs 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 35

Page Faults 30.0 25.0 Pentium 4, 512/512 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m Page faults per block of elements 20.0 15.0 10.0 5.0 0.0 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 36

Cache Misses 30.0 25.0 MIPS 10000, 1024/128 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m L2 cache misses per lines of elements 20.0 15.0 10.0 5.0 0.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 37

TLB Misses 10.0 MIPS 10000, 1024/128 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m TLB misses per block of elements 1.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 38

Conclusions Cache oblivious sorting is possible requires a tall cache assumption M B 1+ε comparable performance with cache aware algorithms Future work more experimental justification for the cache oblivious model limitations of the model time space trade-offs? tool-box for cache oblivious algorithms 39