Cache-Oblivious Algorithms

Similar documents
Cache-Oblivious Algorithms

Data Structures 1 NTIN066

On the Limits of Cache-Obliviousness

Data Structures 1 NTIN066

Cache-Oblivious Computations: Algorithms and Experimental Evaluation

Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy

A Tight Lower Bound for Dynamic Membership in the External Memory Model

Optimal Color Range Reporting in One Dimension

Dynamic Ordered Sets with Exponential Search Trees

On the limits of cache-oblivious rational permutations

CO Algorithms: Brief History

CSE548, AMS542: Analysis of Algorithms, Fall 2017 Date: October 11. In-Class Midterm. ( 7:05 PM 8:20 PM : 75 Minutes )

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

Cache-Oblivious Range Reporting With Optimal Queries Requires Superlinear Space

The Complexity of Constructing Evolutionary Trees Using Experiments

lsb(x) = Least Significant Bit?

Advanced Data Structures

Analysis of Recursive Cache-Adaptive Algorithms. Andrea Lincoln

Advanced Data Structures

Assignment 5: Solutions

The Cost of Cache-Oblivious Searching

Exam 1. March 12th, CS525 - Midterm Exam Solutions

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

Lecture 4: Divide and Conquer: van Emde Boas Trees

Finger Search in the Implicit Model

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Disconnecting Networks via Node Deletions

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Using Hashing to Solve the Dictionary Problem (In External Memory)

Randomization in Algorithms and Data Structures

Cache Complexity and Multicore Implementation for Univariate Real Root Isolation

Introduction to Algorithms 6.046J/18.401J/SMA5503

ICS 233 Computer Architecture & Assembly Language

INF2220: algorithms and data structures Series 1

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

Splay Trees. CMSC 420: Lecture 8

Algorithms and Data Structures

The Divide-and-Conquer Design Paradigm

6.854 Advanced Algorithms

Cache Oblivious Stencil Computations

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

Theory of Computing Systems 2006 Springer Science+Business Media, Inc.

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Caches in WCET Analysis

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Communication avoiding parallel algorithms for dense matrix factorizations

Measurement & Performance

On the Study of Tree Pattern Matching Algorithms and Applications

Measurement & Performance

Parallelization of the QC-lib Quantum Computer Simulator Library

Bloom Filters, general theory and variants

Online Sorted Range Reporting and Approximating the Mode

MATHEMATICAL OBJECTS in

Fibonacci Heaps These lecture slides are adapted from CLRS, Chapter 20.

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Communication Complexity

Lecture 19. Architectural Directions

Timing analysis and timing predictability

Data Structures and Algorithms " Search Trees!!

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

On Adaptive Integer Sorting

Atomic snapshots in O(log 3 n) steps using randomized helping

BRICS. Hashing, Randomness and Dictionaries. Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries

Analysis of Algorithms. Outline 1 Introduction Basic Definitions Ordered Trees. Fibonacci Heaps. Andres Mendez-Vazquez. October 29, Notes.

Cache-Efficient Algorithms II

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Cache-Oblivious Hashing

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Lower Bounds for External Memory Integer Sorting via Network Coding

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

FIBONACCI HEAPS. preliminaries insert extract the minimum decrease key bounding the rank meld and delete. Copyright 2013 Kevin Wayne

Succinct Data Structures for the Range Minimum Query Problems

On the Factor Refinement Principle and its Implementation on Multicore Architectures

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

CSE 4502/5717: Big Data Analytics

Compressed Index for Dynamic Text

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Priority Queues. Sanders: Algorithm Engineering 12. Juni Maintain set M of Elements with Keys. Insert: DeleteMin:

Optimal Bounds for the Predecessor Problem and Related Problems

Insertion Sort. We take the first element: 34 is sorted

A Divide-and-Conquer Algorithm for Functions of Triangular Matrices

CS425: Algorithms for Web Scale Data

Transposition Mechanism for Sparse Matrices on Vector Processors

R ij = 2. Using all of these facts together, you can solve problem number 9.

Parallelization of the QC-lib Quantum Computer Simulator Library

Binary Search Trees. Motivation

Lecture 4. 1 Circuit Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, Jonathan Katz

Circuits. Lecture 11 Uniform Circuit Complexity

Searching. Sorting. Lambdas

CSCI Honor seminar in algorithms Homework 2 Solution

Divide and Conquer Algorithms

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

A Network-Oblivious Algorithms

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

Fast Output-Sensitive Matrix Multiplication

Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications

CMPS 2200 Fall Divide-and-Conquer. Carola Wenk. Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Transcription:

Cache-Oblivious Algorithms 1

Cache-Oblivious Model 2

The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program javac Java bytecode java Interpretation Can be executed on any machine with a Java interpreter 3

The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program javac Java bytecode java Interpretation Can be executed on any machine with a Java interpreter Goal Develop algorithms that are optimized w.r.t. memory hierarchies without knowing the parameters 3

Cache-Oblivious Model CPU Memory I/O Disk I/O model Algorithms do not know the parameters B and M Optimal off-line cache replacement strategy Frigo et al. 1999 4

Justification of the ideal-cache model Optimal replacement LRU + 2 cache size at most 2 cache misses Sleator an Tarjan, 1985 Corollary T M,B (N) = O(T 2M,B (N)) #cache misses using LRU is O(T M,B (N)) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B (N) = O(T 2M,B (N)) optimal #cache misses on each level of a multilevel cache using LRU Fully associativity cache Simulation of LRU Direct mapped cache Explicit memory management Dictionary (2-universal hash functions) of cache lines in memory Expected O(1) access time to a cache line in memory 5

Matrix Multiplication 6

Matrix Multiplication Problem C = A B, c ij = k=1..n a ik b kj Layout of matrices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 32 40 48 56 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 Row major 0 8 16 24 32 40 48 56 0 1 2 3 16 17 18 19 0 1 4 5 16 17 20 21 1 9 17 25 33 41 49 57 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 2 10 18 26 34 42 50 58 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 3 11 19 27 35 43 51 59 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 4 12 20 28 36 44 52 60 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 5 13 21 29 37 45 53 61 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55 6 14 22 30 38 46 54 62 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 7 15 23 31 39 47 55 63 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 Column major 4 4-blocked Bit interleaved 7

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) s Partition A and B into blocks of size s s where s s = Θ( M) Apply Algorithm 1 to the N N matrices where s s elements are s s matrices 0 8 16 24 32 40 48 56 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) 0 Partition A and B into blocks of size s s where s 8 s = Θ( M) 16 24 Apply Algorithm 1 to the N N matrices where 32 s s 40 elements are s s matrices 48 56 s s-blocked or ( row major and M = Ω(B 2 ) ) ( ( ) ) 3 O N s s 2 = O ( ) ( ) N 3 B s B = O N 3 I/Os B M s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 8

Matrix Multiplication Algorithm 1: Nested loops Row major Reading a column of B uses N I/Os Total O(N 3 ) I/Os for i = 1 to N for j = 1 to N c ij = 0 for k = 1 to N c ij = c ij + a ik b kj Algorithm 2: Blocked algorithm (cache-aware) 0 Partition A and B into blocks of size s s where s 8 s = Θ( M) 16 24 Apply Algorithm 1 to the N N matrices where 32 s s 40 elements are s s matrices 48 56 s s-blocked or ( row major and M = Ω(B 2 ) ) ( ( ) ) 3 O N s s 2 = O ( ) ( ) N 3 B s B = O N 3 I/Os B M s 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 Optimal Hong & Kung, 1981 8

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 = A 11B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 8 recursive N 2 N 2 matrix multiplications + 4 N 2 N 2 matrix sums 9

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 = A 11B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 8 recursive N 2 N 2 matrix multiplications + 4 N 2 N 2 # I/Os if bit interleaved or ( row major and M = Ω(B 2 ) ) T (N) T (N) O O( N 2 ) if N ε M B 8 T ( ) ( ) N + O N 2 otherwise ( N 3 2 B M ) B matrix sums 9

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 = A 11B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 8 recursive N 2 N 2 matrix multiplications + 4 N 2 N 2 # I/Os if bit interleaved or ( row major and M = Ω(B 2 ) ) T (N) T (N) O O( N 2 ) if N ε M B 8 T ( ) ( ) N + O N 2 otherwise ( N 3 2 B M ) B matrix sums Optimal Hong & Kung, 1981 Non-square matrices Frigo et al., 1999 9

Matrix Multiplication Algorithm 4: Strassen s algorithm (cache-oblivious) 7 recursive N 2 N 2 C 11 C 12 C 21 C 22 matrix multiplications + O(1) matrix sums = A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 m 1 := (a 21 + a 22 a 11 )(b 22 b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 m 7 m 4 := (a 11 a 21 )(b 22 b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := (a 21 + a 22 )(b 12 b 11 ) m 6 := (a 12 a 21 + a 11 a 22 )b 22 m 7 := a 22 (b 11 + b 22 b 12 b 21 ) 10

Matrix Multiplication Algorithm 4: Strassen s algorithm (cache-oblivious) 7 recursive N 2 N 2 C 11 C 12 C 21 C 22 matrix multiplications + O(1) matrix sums = A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 m 1 := (a 21 + a 22 a 11 )(b 22 b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 m 7 m 4 := (a 11 a 21 )(b 22 b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := (a 21 + a 22 )(b 12 b 11 ) m 6 := (a 12 a 21 + a 11 a 22 )b 22 m 7 := a 22 (b 11 + b 22 b 12 b 21 ) # I/Os if bit interleaved or ( row major and M = Ω(B 2 ) ) O( N 2 T (N) ) if N ε M B 7 T ( ) ( ) N + O N 2 otherwise T (N) O ( N log 2 7 B M 2 ) B log 2 7 2.81 10

Cache-Oblivious Search Trees 11

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Range reportings use O ( log B N + k B ) I/Os Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout van Emde Boas layout h/2 A h B 1 B k h/2 A B 1 B k Degree O(1) Searches use O(log B N) I/Os Range reportings use O ( log B N + k B ) I/Os Best possible (log 2 e + o(1)) log B N Prokop 1999 Bender, Brodal, Fagerberg, Ge, He, Hu Iacono, López-Ortiz 2003 12

Dynamic Cache-Oblivious Trees Embed a dynamic tree of small height into a complete tree Static van Emde Boas layout Rebuild data structure whenever N doubles of halves 1 3 4 5 6 8 7 11 10 13 Search O(log B N) Range Reporting Updates ) O ( log B N + k B O ( log B N + log2 N B ) Brodal, Fagerberg, Jacob 2001 13

Example 1 3 4 5 6 8 7 11 10 13 6 4 8 1 3 5 7 11 10 13 14

Binary Trees of Small Height 1 3 4 5 6 8 7 11 10 13 2 New 1 2 3 4 5 6 8 7 11 10 13 If an insertion causes non-small height then rebuild subtree at nearest ancestor with sufficient few descendents Insertions require amortized time O(log 2 N) Andersson and Lai 1990 15

Binary Trees of Small Height For each level i there is a threshold τ i = τ L + i, such that 0 < τ L = τ 0 < τ 1 < < τ H = τ U < 1 For a node v i on level i definethe density Insertion ρ(v i ) = # nodes below v i m i where m i = # possible nodes below v i with depth at most H Insert new element If depth > H then locate neirest ancestor v i with ρ(v i ) τ i and rebuild subtree at v i to have minimum height and elements evenly distributed between left and right subtrees Andersson and Lai 1990 16

Binary Trees of Small Height Theorem Insertions require amortized time O(log 2 N) Proof Consider two redistributions of v i After the first redistribution ρ(v i ) τ i Before second redistribution a child v i+1 of v i has ρ(v i+1 ) > τ i+1 Insertions below v i : m(v i+1 ) (τ i+1 τ i ) = m(v i+1 ) Redistribution of v i costs m(v i ), i.e. per insertion below v i Total insertion cost per element m(v i ) m(v i+1 ) 2 H i=0 2 = O(log2 N) Andersson and Lai 1990 17

Memory Layouts of Trees 2 1 9 4 8 12 3 6 10 13 2 6 10 14 4 5 7 8 11 12 14 15 1 3 5 7 9 11 13 15 DFS inorder 2 1 3 2 1 3 4 5 6 7 4 7 10 13 8 9 10 11 12 13 14 15 5 6 8 9 11 12 14 15 BFS van Emde Boas (in theory best) 18

Searches in Pointer Based Layouts 0.001 pointer bfs pointer dfs pointer veb pointer random insert pointer random layout 0.0001 1000 10000 100000 1e+06 van Emde Boas layout wins, followed by the BFS layout 19

Searches with Implicit Layouts 0.001 implicit bfs implicit dfs implicit veb implicit in-order implicit 9-ary bfs 0.0001 1000 10000 100000 1e+06 BFS layout wins due to simplicity and caching of topmost levels van Emde Boas layout requires quite complex index computations 20

Implicit vs Pointer Based Layouts 0.001 pointer bfs implicit bfs 0.001 pointer veb implicit veb 0.0001 1000 10000 100000 1e+06 BFS layout 0.0001 1000 10000 100000 1e+06 van Emde Boas layout Implicit layouts become competitive as n grows 21

Insertions in Implicit Layouts 0.1 implicit bfs random inserts implicit in-order random inserts implicit veb random inserts 0.01 0.001 0.0001 100000 1e+06 Insertions are rather slow (factor 10-100 over searches) 22

Summary Dynamic cache-oblivious search trees Search O(log B N) Range Reporting Updates ) O ( log B N + k B O ( log B N + log2 N B Update time O(log B N) by one level of indirection (implies sub-optimal range reporting) Importance of memory layouts van Emde Boas layout gives good cache performance Computation time is important when considering caches 6 4 8 1 3 5 7 11 10 13 ) 23

Cache-Oblivious Sorting 24

Sorting Problem Input : array containing x 1,..., x N Output : array with x 1,..., x N in sorted order Elements can be compared and copied 3 4 8 2 8 4 0 4 4 6 0 2 3 4 4 4 4 6 8 8 25

Binary Merge-Sort Ouput 0 2 3 4 4 4 4 6 8 8 Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging Input 3 4 8 2 8 4 0 4 4 6 26

Binary Merge-Sort Ouput 0 2 3 4 4 4 4 6 8 8 Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging Input 3 4 8 2 8 4 0 4 4 6 Recursive; two arrays; size O(M) internally in cache O(N log N) comparisons O ( N log ) B 2 N M I/Os 26

Merge-Sort Degree I/O 2 O ( N log ) B 2 N M d (d M 1) B Θ ( M B ) O ( N B log d N M O ( N log ) B M/B N M = O(SortM,B (N)) Aggarwal and Vitter 1988 Funnel-Sort 2 O( 1 ε Sort M,B(N)) (M B 1+ε ) Frigo, Leiserson, Prokop and Ramachandran 1999 ) Brodal and Fagerberg 2002 27

Lower Bound Brodal and Fagerberg 2003 Block Size Memory I/Os Machine 1 B 1 M t 1 Machine 2 B 2 M t 2 One algorithm, two machines, B 1 B 2 Trade-off 8t 1 B 1 + 3t 1 B 1 log 8Mt 2 t 1 B 1 N log N M 1.45N 28

Lower Bound Assumption I/Os Lazy Funnel-sort Binary Merge-sort B M 1 ε (a) B 2 = M 1 ε : Sort B2,M(N) (b) B 1 = 1 : Sort B1,M(N) 1 ε B M/2 (a) B 2 = M/2 : Sort B2,M(N) (b) B 1 = 1 : Sort B1,M(N) log M Corollary (a) (b) 29

Funnel-Sort 30

k-merger Frigo et al., FOCS 99 Sorted output stream M k sorted input streams 31

k-merger Frigo et al., FOCS 99 Sorted output stream M 0 k 1/2 -mergers M Recursive def. = B 1 B k buffers of size k 3/2 M 1 M k k sorted input streams 31

k-merger Frigo et al., FOCS 99 Sorted output stream M 0 k 1/2 -mergers M Recursive def. = B 1 B k buffers of size k 3/2 M 1 M k k sorted input streams M 0 B 1 M 1 B 2 M 2 B k M k Recursive Layout 31

Lazy k-merger Brodal and Fagerberg 2002 M 0 B 1 B k M 1 M k 32

Lazy k-merger Brodal and Fagerberg 2002 M 0 B 1 B k M 1 M k Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step 32

Lazy k-merger Brodal and Fagerberg 2002 M 0 B 1 B k M 1 M k Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step Lemma If M B 2 and output buffer has size k 3 then O( k3 B log M(k 3 ) + k) I/Os are done during an invocation of Fill(root) 32

Funnel-Sort Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999 Divide input in N 1/3 segments of size N 2/3 Recursively Funnel-Sort each segment Merge sorted segments by an N 1/3 -merger k N 1/3 N 2/9 N 4/27. 2 33

Funnel-Sort Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999 Divide input in N 1/3 segments of size N 2/3 Recursively Funnel-Sort each segment Merge sorted segments by an N 1/3 -merger k N 1/3 N 2/9 N 4/27. 2 Theorem Funnel-Sort performs O(Sort M,B (N)) I/Os for M B 2 33

Hardware Processor type Pentium 4 Pentium 3 MIPS 10000 Workstation Dell PC Delta PC SGI Octane Operating system GNU/Linux Kernel GNU/Linux Kernel IRIX version 6.5 version 2.4.18 version 2.4.18 Clock rate 2400 MHz 800 MHz 175 MHz Address space 32 bit 32 bit 64 bit Integer pipeline stages 20 12 6 L1 data cache size 8 KB 16 KB 32 KB L1 line size 128 Bytes 32 Bytes 32 Bytes L1 associativity 4 way 4 way 2 way L2 cache size 512 KB 256 KB 1024 KB L2 line size 128 Bytes 32 Bytes 32 Bytes L2 associativity 8 way 4 way 2 way TLB entries 128 64 64 TLB associativity Full 4 way 64 way TLB miss handler Hardware Hardware Software Main memory 512 MB 256 MB 128 MB 34

Wall Clock 100.0µs Pentium 4, 512/512 ffunnelsort funnelsort lowscosa stdsort ami_sort msort-c msort-m Wall clock time per element 10.0µs 1.0µs 0.1µs 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 35

Page Faults 30.0 25.0 Pentium 4, 512/512 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m Page faults per block of elements 20.0 15.0 10.0 5.0 0.0 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 36

Cache Misses 30.0 25.0 MIPS 10000, 1024/128 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m L2 cache misses per lines of elements 20.0 15.0 10.0 5.0 0.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 37

TLB Misses 10.0 MIPS 10000, 1024/128 ffunnelsort funnelsort lowscosa stdsort msort-c msort-m TLB misses per block of elements 1.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 38

Conclusions Cache oblivious sorting is possible requires a tall cache assumption M B 1+ε comparable performance with cache aware algorithms Future work more experimental justification for the cache oblivious model limitations of the model time space trade-offs? tool-box for cache oblivious algorithms 39