The Cost of Cache-Oblivious Searching

Size: px

Start display at page:

Download "The Cost of Cache-Oblivious Searching"

Quentin Phillips
6 years ago
Views:

1 The Cost of Cache-Oblivious Searching Michael A Bender Gerth Stølting Brodal Rolf Fagerberg Dongdong Ge Simai He Haodong Hu John Iacono Alejandro López-Ortiz Abstract This paper gives tight bounds on the cost of cache-oblivious searching The paper shows that no cache-oblivious search structure can guarantee a search performance of fewer than lgelog B N memory transfers between any two levels of the memory hierarchy This lower bound holds even if all of the bloc sizes are limited to be powers of 2 The paper gives modified versions of the van Emde Boas layout, where the expected number of memory transfers between any two levels of the memory hierarchy is arbitrarily close to [lge+olg/)]log B N +O) This factor approaches lge 443 as B increases The expectation is taen over the random placement in memory of the first element of the structure Because searching in the dis-access machine DAM) model can be performed in log B N+O) bloc transfers, this result establishes a separation between the2-level) DAM model and cache-oblivious model The DAM model naturally extends to levels The paper also shows that as grows, the search costs of the optimal -level DAM search structure and the optimal cache-oblivious search structure rapidly converge This result demonstrates that for a multilevel memory hierarchy, a simple cache-oblivious structure almost replicates the performance of an optimal parameterized -level DAM structure Keywords: cache-oblivious B-tree, cache-oblivious searching, van Emde Boas layout Introduction Hierarchical Memory Models Traditionally, algorithms are designed to run efficiently in a random access model RAM) of computation, which assumes a flat memory with uniform access times However, as hierarchical memory systems become steeper and more complicated, algorithms are increasingly designed assuming more accurate memory models; see eg, [2 5, 7 9, 28, 33 35, 39 4] Two of the most successful memory models are the dis-access model DAM) and the cache-oblivious model An earlier version of this paper in extended abstract form appears in the Proceedings of the 44th Annual Symposium on the Foundations of Computer Science FOCS), pages , 2003 [3] Department of Computer Science, Stony Broo University, Stony Broo, NY , USA, and Toute, Inc Lexington, MA bender@cssunysbedu Supported in part by NSF Grants CCF /062425, CCF / , CCF / , CNS , and CCF and by DOE Grant DE-FG02-08ER25853 MADALGO - Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, Department of Computer Science, Aarhus University, Ny Munegade, 8000 Århus C, Denmar gerth@csaud Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST ALCOM-FT) and the Carlsberg Foundation contract number ANS-0257/20) Department of Mathematics and Computer Science, University of Southern Denmar, Campusvej 55, DK-5230 Odense M, Denmar rolf@imadasdud Department of Management Science and Engineering, Antai School of Economics and Management, Shanghai JiaoTong University, Shanghai, China, ddge@sjtueducn Department of System Engineering and Engineering Management, Chinese University of Hongong, Hongong, China smhe@secuheduh Networing and Device Connectivity at Windows Division, Microsoft, Redmond, WA, haodhu@microsoftcom Department of Computer and Information Science, Polytechnic University, Broolyn NY 20, USA jiacono@polyedu Research supported in part by NSF grants CCF and OISE and by an Alfred P Sloan Fellowship School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G, Canada alopez-o@uwaterlooca

2 The DAM model, developed by Aggarwal and Vitter [4], is a two-level memory model, in which the memory hierarchy consists of an internal memory of size M and an arbitrarily large external memory partitioned into blocs of size B Algorithms are designed in the DAM model with full nowledge of the values of B and M Because memory transfers are relatively slow, the performance metric is the number of bloc transfers The cache-oblivious model, developed by Frigo, Leiserson, Proop, and Ramachandran [27, 3], allows programmers to reason about a two-level memory hierarchy but to prove results about an unnown multilevel memory hierarchy As in the DAM model, the objective is to minimize the number of bloc transfers between two levels The main idea of the cache-oblivious model is that by avoiding any memory-specific parametrizationsuch as the bloc sizes) the cache-oblivious algorithm has an asymptotically optimal number of memory transfers between all levels of an unnown, multilevel memory hierarchy Optimal cache-oblivious algorithms have memory performance ie, number of memory transfers) that is within a constant factor independent of B and M) of the memory performance of the optimal DAM algorithm, which nows B and M There exist surprisingly many asymptotically) optimal cache-oblivious algorithms see, eg, [24, 30]) I/O-Efficient Searching This paper focuses on the fundamental problem of searching: Given a set S of N comparison-based totally-ordered elements, produce a data structure that can execute searches or predecessor queries) on items in S We prove tight bounds on the cost of cache-oblivious searching We show that no cache-oblivious search structure can guarantee that a search performs fewer than lgelog B N bloc transfers between any two levels of the memory hierarchy, even if all of the bloc sizes are limited to powers of 2 We also give search structures in which the expected number of bloc transfers between any two levels of the memory hierarchy is arbitrarily close to [lge+olg/)]log B N +O), which approaches lgelog B N +O) for large B This expectation is taen over the random placement in memory of the first element of the structure In contrast, the performance of the B-tree [0,23], the classic optimal search tree in the DAM model, is as follows: A B-tree with N elements has nodes with fan-out B, which are designed to fit into one memory bloc The B-tree has height log B N +, and a search requires log B N + memory transfers A static cache-oblivious search tree, proposed by Proop [3], also performs searches in Θlog B N) memory transfers The static cache-oblivious search tree is built as follows: Embed a complete binary tree with N nodes in memory, conceptually splitting the tree at half its height, thus obtaining Θ N) subtrees each with Θ N) nodes Lay out each of these trees contiguously, storing each recursively in memory This type of recursive layout is commonly referred to in the literature as a van Emde Boas layout because it is reminiscent of the recursive structure of the van Emde Boas tree [37, 38] The static cache-oblivious search tree is a basic building bloc of most cache-oblivious search structures, including the dynamic) cache-oblivious B-tree [4, 5, 5, 22, 32] and other cache-oblivious search structures [, 6,, 2, 6 2, 25, 26] Any improvements to the static cache-oblivious search structure immediately translate to improvements in these dynamic structures Results We present the following results: We give an analysis ) of Proop s static cache-oblivious search tree [3], proving that searches perform at most B log B N +O) expected memory transfers; the expectation is taen only over the random placement of the data structure in memory This analysis is tight to within a +o) factor We then present a class of generalized van Emde Boas layouts that optimizes performance through the use of uneven splits on the height of the tree For any constant ε > 0, we optimize the layout achieving a performance of [lge+ε+olg/)]log B N +O) expected memory transfers As Throughout the paper lgn means log 2 N 2

3 before, the expectation is taen over the random placement of the data structure in memory 2 While the derivations in the proof of the upper bound might not provide great insight into the reasons why it holds, its mere existence is valuable information It shows that the tightness of the lower bound and points to the surprising fact that uneven splits, contrary to established practice, are of ey importance in the building of this data structure Intuitively, the improvement of uneven splitting, as compared to the even splitting in the standard van Emde Boas layout, is liely to be due to the generation of a variety of subtree sizes at each recursive level of the layout Such a variety will on any search path reduce the number of subtrees that can have particularly bad sizes compared to the bloc size B We strengthen this intuition by presenting another generalization of the van Emde Boas layout, which generates an explicit distribution of subtree sizes at each recursive level The definition of this layout is more involved than for the first layout presented, but the corresponding proof of complexity is shorter and more direct The achieved complexity is the same as for the first layout Finally, we demonstrate that it is harder to search in the cache-oblivious model than in the DAM model Previouslythe only lower bound for searchingin the cache oblivious model was the log B N lower bound from the DAM model We prove a lower bound of lgelog B N memory transfers for searching in the average case in the cache-oblivious model Thus, for large B, our upper bound is within a factor of + o) of the optimal cache-oblivious layout Interpretation We present cache-oblivious search structures that tae 44% more bloc transfers than the optimal DAM structure, and we prove that one cannot do better However, this result does not mean that our cache-oblivious structures are 44% slower than an optimal algorithm for a multilevel memory hierarchy To the contrary, this worst-case behavior only occurs on a two-level memory hierarchy To design a structure for a -level memory hierarchy, one can extend the DAM model to levels A data structure for a -DAM is designed with full nowledge of the size and bloc size of each level of the memory hierarchy Thus, the 2-DAM is the standard DAM where searches cost log B N + bloc transfers using a B-tree) Surprisingly, in the 3-DAM this performance cannot be replicated in general We show in Corollary 23 that a 3-DAM algorithm cannot achieve less than 207log B N bloc transfers on all levels simultaneously Thus, the performance gap between a 3-DAM and the optimal cache-oblivious structure is about half that of the 2-DAM and the optimal cache-oblivious structure; naturally, a modern memory hierarchy has more than three levels Furthermore, we show that as the number of levels in the memory hierarchy grows, the performance loss of our cache-oblivious structures relative to an optimal -DAM structure tends to zero Thus, for a modern memory hierarchy, our cache-oblivious structures combines simplicity and near-optimal performance Our cache-oblivious search trees also provide new insight into the optimal design strategy for divideand-conquer algorithms More generally, it has been nown for several decades that divide-and-conquer algorithms frequently have good data locality [36] The cache-oblivious model helps explain why divide-andconquer can be advantageous When there is a choice, the splitting in a divide-and-conquer algorithm is traditionally done evenly The unquestioned assumption is that splitting evenly is best Our new search structures reveal a setting in which it is crucial that the resulting subtrees across the recursive decomposition not be of equal size Indeed, even splits lead to a performance slowdown factor of two in the worst case as each access might straddle two blocs In contrast uneven splits lead to an expected slowdown of at most lge 443 This shows that so long as the placement in memory is random, the tree accesses are reasonably well aligned with the cache blocs on the average In this case, the intuitive reasons behind the superiority of uneven split is that the regularity of even splits allows for the construction of adversarial worst case configurations This can be observed from the proof of Theorem 33 2 Note that explicit programmer intervention might be needed to ensure random placement of the data structure This could be accomplished by means such as requesting a bloc of a fixed larger size and then placing the data structure at a small random offset from the head of the allocated bloc 3

4 2 Lower Bound for Cache-Oblivious Searching In this section we prove lower bounds on the I/O cost of comparison-based search algorithms optimizing for several bloc sizes at the same time The specific problem we consider is the average cost of successful searches among N distinct elements, for a uniform distribution of the search ey y on the N input elements As average case lower bounds are also worst case lower bounds, our results also apply to the worst case cost We emphasize that our bounds hold even if the bloc sizes are nown to the algorithm, and that no randomization of the placement of the data structure of the search algorithm is assumed Formally, our model is as follows Given a set S of N elements x < < x N from a totally ordered universe, a search structure for S is an array M containing elements from S, possibly with several copies of each A search algorithm for M is a binary decision tree where each internal node is labeled with either y < M[i] or y M[i] for some array index i, and each leaf is labeled with a number j N A search on a ey y proceeds in a top-down fashion in the tree, and at each internal node advances to the left child if the comparison given by the label is true, otherwise it advances to the right The binary decision tree is a correct search algorithm if for any x i S, the path taen by a search on ey y = x i ends in a leaf labeled i Any such tree must have at least N leaves, and by pruning paths not taen by any search for x,,x N, we may assume that it has exactly N leaves To add I/Os to the model, we divide the array M into contiguous blocs of size B An internal node of a search algorithm is said to access the bloc containing the array index i in the label of the node We define the I/O cost of a search to be the number of distinct blocs of M accessed on the path taen by the search We assume in our model that bloc sizes are powers of two and that blocs start at memory addresses divisible by the bloc size This reflects the situation on actual machines, and entails no loss of generality, as any cache-oblivious algorithm at least should wor for this case The assumption implies that when considering two bloc sizes B < B 2, a bloc of size B is contained in exactly one bloc of size B 2 The depth of a leaf in a tree is the number edges on its path to the root, and the height of a tree is the maximal depth among its leaves We recall the following standard result on the average depth of leaves in binary trees Lemma 2 [29, Section 2345]) For a binary tree with N leaves, the average depth of a leaf is at least lgn In our lower bound proof, we analyze the I/O cost of a given search algorithm with respect to several bloc sizes simultaneously We first describe our method for the case of two bloc sizes This leads to a lower bound of 207log B N bloc transfers We then generalize this proof to a larger number of bloc sizes, and prove that in the limit as grows, this gives a lower bound of lgelog B N 443log B N bloc transfers Lemma 22 If a search algorithm on a search structure for bloc sizes B and B 2, where B 2 = B c and < c 2, guarantees that the average number of distinct blocs read during searches is at most δlog B N and δlog B2 N, respectively, then δ 2/c+c 2+3/c ) Proof: Let T denote the binary decision tree constituting the search algorithm The main idea of the proof is to transform T into a new binary decision tree T by a transformation on nodes along each search path Nodes that access a new bloc on the search path will be substituted by inserting small binary decision trees in their place These decision trees are carefully chosen to have low height and to allow path nodes not accessing new blocs to be discarded The end result is that the transformed path consists of one root-to-leaf path from each of the trees inserted along the original path, and there is one such tree for each distinct bloc access along the original path By the bounded height of the inserted trees, a lower bound on the lengths of the transformed paths gives a lower bound on the distinct bloc accesses along the original paths Lemma 2 supplies such a lower bound 4

5 The transformation will be done simultaneously for all search paths in T by a top-down process, maintaining a decision tree at all times During the process, it will be an invariant that a search for any ey ends in a leaf having the same label as the leaf where the same search ends in T In particular, the final tree T is a correct search algorithm if T is To count the number of I/Os of each type size B blocs and size B 2 blocs) for each path in T, we mar some of the internal nodes by toens τ and τ 2 A node v is mared iff none of its ancestors accesses the size B bloc accessed by v, ie if v is the first access to the bloc The node v may also be the first access to the size B 2 bloc accessed by v In this case, v is mared by τ 2, else it is mared by τ Note that the word first above corresponds to viewing each path in the tree as a timeline this view will be implicit in the rest of the proof For any root-to-leaf path, let b i denote the number of distinct size B i blocs accessed and let a i denote the number of τ i toens on the path, for i =,2 By the assumption stated above Lemma 2, a first access to a size B 2 bloc implies a first access to a size B bloc, so we have b 2 = a 2 and b = a +a 2 As said, we transform T into a new binary decision tree T in a top-down fashion The basic step in the transformation is to substitute a mared node v with a specific binary decision tree T v resolving the order-wise relation between the search ey y and a carefully chosen subset S v of the elements stored in M More precisely, in each step of the transformation, the subtree rooted at v is first removed, then the tree T v is inserted at v s former position, and finally a copy of one of the two subtrees rooted at the children of v is inserted at each leaf of T v The set S v is always chosen such that it includes the element accessed by the comparison in v Hence, at each leaf of T v, the order-wise relation to that element is nown, and the appropriate one of the two subtrees can be inserted The top-down transformation then continues downwards from the leaves of T v When the transformation reaches a leaf of T, it is left unchanged Note that the invariant mentioned above is maintained at each step We now describe the tree T v inserted, given that the set S v of elements has been selected The tree T v is a binary decision tree of minimal height resolving the relation of the search ey y to all eys in S v If we have S v = {z,z 2,,z t }, with elements listed in sorted order, this amounts to resolving to which of the 2t+ intervals ;z ), [z ;z ], z ;z 2 ),, [z t ;z t ], z t ; ) y belongs That tree has height at most lg2t+), since a perfectly balanced binary search tree on S v having, say, comparisons of type < in the internal nodes), with one layer of t nodes added at the leaves to resolve the equality questions, will do We now describe how the set S v is chosen First consider the case of a node v mared τ 2 Here, we let the subset S v consist of the at most B distinct elements in the bloc of size B accessed by v, plus every B 2 2B th element in sorted order among the at most B 2 distinct elements in the bloc of size B 2 accessed by v The size of S v is at most B +B 2 /B 2 /2B )) = 3B, so the tree T v has height at most lg6b +) As B is a power of two, lg8b ) is an integer and hence an upper bound on the height Next, for the case of a node v mared τ, note that v in T has exactly one ancestor u mared τ 2 that accesses the same size B 2 bloc β as v does When the tree T u was substituted for u, the inclusion in S u of the 2B evenly sampled elements from β ensures that below any leaf of T u, at most B2 2B of the elements in β can still have an unnown relation to the search ey We let S v be these B2 2B elements The corresponding tree T v has height at most at most lg2b 2 /2B )+), which is at most 2 /B, as B and B 2 are powers of two For the case of an unmared internal node v ie a node where the size B bloc accessed at the node has been accessed before), we can simply discard v together with either the left or right subtree, since we already have resolved the relation between the search ey y and the element accessed at v This follows from the choice of trees inserted at mared nodes: when we access a size B 2 bloc β 2 for the first time at some node u, we resolve the relation between the search ey y and all elements in the size B bloc β accessed at u due to the inclusion of all of β in S u ), and when we the first time access a ey in β 2 outside β, we resolve all remaining relations between y and elements in β 2 By the height stated above for the inserted T v trees, it follows that if a search for a ey y in T corresponds to a path containing a and a 2 toens of type τ and τ 2, respectively, then the search in T corresponds to 5

6 a path with length bounded from above by the following expression a 2 lg8b )+a lg B 2 B = b 2 lg8b )+b b 2 )lg B 2 B = b 2 [ lg8b ) lg B 2 B ] +b lg B 2 B The coefficients of b 2 and b are positive by the assumption B < B 2 B 2, so upper bounds on b and b 2 imply an upper bound on the expression above By assumption, the average over all searches of b and b 2 are bounded by δlog B N and δlog B2 N = δlog B N)/c, respectively If we prune the tree for paths not taen by any search for the eys x,,x N, the lengths of root-to-leaf paths can only decrease The resulting tree has N leaves, and each leaf corresponds to a search, so averages over searches and averages over leaves are the same As Lemma 2 gives a lgn lower bound on the average depth of a leaf, we get lgn δ [ c log B N lg8b ) lg B ] 2 +δlog B B N lg B 2 B = δ c log B N[3+ c ) ]+δlog B Nc ) = δlgn[3/c )+/c c )/c+c )] = δlgn[3/c )+c+2/c 2] It follows that δ /[3/c )+c+2/c 2] Corollary 23 If a search algorithm on a search structure guarantees, for all bloc sizes B, that the average number of distinct blocs read during searches is at most δlog B N, then δ /2 2 2) 207 Proof: Letting c = 2 in Lemma 22, we get δ /[ / 2 )] The lower bound follows by letting B grow to infinity Lemma 24 If a search algorithm on a search structure for bloc sizes B,B 2,,B, where B i = B c i and = c < c 2 < < c 2, guarantees that the average number of distinct blocs read during searches is at most δlog Bi N for each bloc size B i, then δ [ ] 2 c + lg8) 2 + i= c i+ c i Proof: The proof is a generalization of the proof of Lemma 22 for two bloc sizes, and we here assume familiarity with that proof The transformation is basically the same, except that we have a toen τ i, i =,,, for each of the bloc sizes Again, a node v is mared if none of its ancestors access the size B bloc accessed by v, ie if v is the first access to this bloc The node v may also be the first access to blocs of larger sizes, and we mar v by τ i, where B i is the largest bloc size for which this is true Note that by the assumption stated above Lemma 2, v must be the first access to the size B j bloc accessed by v for all j with j i For any root-to-leaf path, let b i denote the number of distinct size B i blocs accessed and let a i denote the number of τ i toens on the path, for i =,, We have b i = j=i a j Solving for a i, we get a = b and a i = b i b i+, for i =,, As in the proof of Lemma 22, the transformation proceeds in a top-down fashion, and substitutes mared nodes v by binary decision trees T v We now describe the trees T v for different types of nodes v For a node v mared τ, the tree T v resolves the relation between the query ey y and a set S v of size 2 )B, consisting of the B elements in the bloc of size B accessed at v, plus for i = 2,, every B i 2B th element in sorted order among the elements in the bloc of size B i accessed at v This tree can be chosen to have height at most lg22 )B +) lg8b ) 6

7 Foranode v maredτ i, i <, let β j be the blocofsizeb j accessedbyv, for j Fori+ j, β j has been accessed before, by the definition of τ i We now consider two cases Case I is that β i+ is the only bloc of size B i+ that has been accessed inside β By the definition of the tree T u inserted at the ancestor u of v where β was first accessed, at most B i+ /2B of the elements in β i+ can have unnown relations with respect to the search ey y The tree T v inserted at v resolves these relations It can be chosen to have height at most lg Bi+ B Case II is that β i+ is not the only bloc of size B i+ that has been accessed inside β Then consider the smallest j for which β j+ is the only bloc of size B j+ that has been accessed inside β clearly, j ) When we the first time accessed the second bloc of size B j inside β at some ancestor u of v, this access was inside β j+, and a Case I substitution as described above too place Hence a tree T u was inserted which resolved all relations between the search ey and elements in β j+ This includes the element accessed by the comparison at v, so the empty tree can be used for T v, ie v and one of its subtrees is simply discarded For an unmared node v, there is a toen τ i on the ancestor u of v in T for which the size B bloc β accessed by v was first accessed This gave rise to a tree T u in the transformation, and this tree resolved the relations between the search ey and all elements in β, either directly i = ) or by resolving the relations for all elements in a bloc containing β i < ), so v and one of its subtrees can be discarded After transformation and final pruning, the length of a root-to-leaf path in the final tree is bounded from above by the following equation a lg8b )+ a i lg B i+ B i= = b lg8b )+ b i b i+ )c i+ ) i= [ = b + lg8) [ = ) b 2+ lg8) ) c ] +b c 2 ) + b i c i+ c i ) b c ) + i=2 ] b i c i+ c i ) i= For all i, the average value of b i over all search paths is by assumption bounded by δlog Bi N = δlog B N)/c i, and the coefficient of b i is positive, so we get the following upper bound on the average number of comparisons on a search path [ δlog B N [ = δlgn c c 2+ lg8) ) 2+ lg8) ) c + i= c i+ c i + ] i= ] c i+ c i ) c i By Lemma 2 we have [ δlgn c 2+ lg8) ) + i= c i+ c i ] lgn, and the lemma follows Theorem 25 If a search algorithm on a search structure guarantees, for all bloc sizes B, that the average number of distinct blocs read during searches is at most δlog B N, then δ lge 443 7

8 Proof: Let be an integer, and for i =,, define B i = 2 +i In particular, we have B = 2 and B i = B c i with c i = +i )/ Consider the following subexpression of Lemma 24 = = = 2 c + lg8) ) lg8) lg8) lg8) 2 + lg8) 2 + ) ) i= c i+ c i +i + +i i= + ) + i= 2 2 ) +ln2 +i x dx Letting grow to infinity Lemma 24 implies δ /ln2 = lge 3 Upper Bound for van Emde Boas Layout In this section we give a tight analysis of the cost of searching in a binary tree stored using the van Emde Boas layout [3] As mentioned earlier, in the veb layout, the tree is split evenly by height, except for roundoff Thus, a tree of height h is split into a top tree of height h/2 and bottom tree of height h/2 It is nown [5,22] that the number of memory transfers for a search is 4log B N in the worst case; we give a matching configuration showing that this analysis is tight We then consider the average-case performance over all starting positions of the tree in memory, and we show that the expected search cost is 2 + 3/ B)log B N +O) memory transfers, which is tight within a +o) factor We assume that the data structure begins at a random position in memory; if there is not enough space, then the data structure wraps around to the first location in memory A relatively straightforward analysis of this layout shows that in the worst case the number of memory transfers is no greater than four times that of the optimal cache-size-aware layout More formally, Theorem 3 Consider an N )-node complete binary search tree that is stored ) using the Proop veb 4 layout A search in this tree has memory-transfer cost of at least 4 log +log 6 B B N and at most 4log B N in the worst case Proof: The upper bound has been established before in the literature [5, 22] For the lower bound we show that this value is achieved asymptotically Let the bloc size be B = 2 2 ) /5 for any even number 4 and consider a tree T of size N, where N = 2 2m+ for some constant m Number the positions within a bloc from 0 to B As we recurse, we eventually obtain subtrees of size 5B = 2 2 and one level down of size 2 We align the subtree of size 5B that contains the root of T so that its first subtree R of size 2 which also contains the root of T) starts in position B of a bloc In other words, any root-to-leaf search path in the subtree of size 5B crosses the bloc boundary because the root is in the last position of a bloc Observe that R has 2 subtrees of size 2 hanging from its bottom level Number these 2 subtrees as they are laid in memory in the recursive decomposition and consider the 22 +)/5 +-th subtree of size 2 in this ordering The root of this tree starts at a position between B +2 )22 +)/5 )+ = 3B 2 ) and B +2 )22 +)/5 = 3B Thus, any root-to-leaf search path in this subtree crosses the bloc boundary Observe that because trees are laid out consecutively, and 5B is a multiple of the bloc size, all other subtrees of size 5B start at position B inside 8

9 a bloc and share the above that we can find a root-to-leaf path that has cost 4 inside this size-5b subtree Notice that a root-to-leafpath accesses2 m many size-5b subtrees, and if we choose the path accordingto the above position we now that the cost inside each size 5B subtree is 4, ie, the first 4 blocs as a contiguous structure of size 5B may span at most 6 blocs Thus, the total search cost is 4 2 m Because 2 2m+ = N and 5B = 2 2, we have 4 2 m = 4log B N log B 5B +) = 4 lg5b +) log B N Furthermore, we bound the parameter /lg5b +) as follows: lg5b +) > lg6b lg6 = lg6+ = +log 6 B Therefore, the total search cost has 4 /+log 6 B))log B N memory transfers in the worst case However, few paths in the tree have this property, which suggests that in practice, the Proop veb layout results in a much lower memory-transfer cost assuming random placement in memory In Theorem 33, appearing shortly, we formalize this notion First, however, we give the following useful inequality to simplify the proof Claim 32 Let B be a power of 2, t and t be positive numbers satisfying t/2 t 2t, B/2 t B, and tt B Then 2+ t+t B ) lgt+lgt B Proof: Because t 2 +t ) 2 5tt /2 for all t/2 t 2t, we have tt t+t 3 2 ) Define x = tt and define fx) = ) lgx B 2 3 x B 2 We will show that fx) 0 for B x 2B First, we calculate the second derivative of fx) f x) = ) B x 2 lnb B x 3/2 Because x 2B ie, x /2 2B), we obtain [ f x) 3 2B x 2 4 2B ) ] B lnb By removing the term 6/ BlnB), we bound f x) as follows: f x) 3 x 2 4 B 2 ) 0 lnb 9

10 Thus, we establish that fx) is convex in the range B x 2B Because both fb) and f2b) are greater than zero, we obtain fx) 0 for B x 2B, which is equivalent to 2+ 3 x B ) lgx B From ) and the above inequality, we obtain the follows: 2+ t+t B ) lgx B Theorem 33 Consider a path in an N )-node complete binary search tree of height h that is stored in veb layout, with the start of its representation in memory determined uniformly at random within a bloc B Then the expected memory-transfer cost of the search is at most 2+3/ B)log B N Proof: Although the recursion proceeds to the base case where trees have height, conceptually we stop the recursion at the level of detail where each recursive subtree has at most B nodes We call those subtrees critical recursive subtrees, because they are recursive subtrees in the most important level of detail Let the number of nodes in a subtree T be T Therefore, any critical recursive subtree T has T nodes, where B/2 T B Note that because of roundoff, we cannot guarantee that T B In particular, if a tree has B + nodes and its height h is odd, then the bottom trees have height h /2, and therefore contain roughly B/2 nodes Then there are exactly T initial positions for the upper tree that results in T being laid out across a bloc boundary Similarly there are B T + positions in which the bloc does not cross a bloc boundary Hence, the local expected cost of accessing T is 2 T ) B + B T + B = + T B Now we need two cases to deal with the roundoff If B/2 T B for the critical recursive subtree T, then we consider the next larger level of detail There exists another critical recursive subtree T immediately above T on the search path in this level of detail Notice that T T B Because otherwise we would consider the coarser level of detail for our critical recursive subtree Because we cut in the middle, we now that 2 T T T /2 From Claim 32 the expected cost of accessing T and T is at most + T B ++ T ) lg T T ) B B For B T B for the critical recursive subtree T, we show that the cost of accessing T is less than 2+/ B)lg T / Define fx) as follows: fx) = 2 lgx + ) x B B By calculating f x) = 2 x 2 + ) 0, B we now fx) is convex Because both f B) and fb) are greater than zero, we obtain fx) 0 for the entire range B x B Thus, considering f T ), we obtain that the expected cost of accessing T is + T 2 + ) lg T B B 0

11 Combining the above arguments, we conclude that although the critical recursive subtrees on a search path may have different sizes, their expected memory-transfer cost is at most ) lg T B = ) log B N B T This is a factor of 2+3/ B) times the optimal) performance of a B-tree 4 Upper Bounds for Cache Oblivious Searching We give two cache-oblivious layouts for achieving our main upper bound One layout, which we term the generalized van Emde Boas layout is very simply defined, but has a lengthy proof of complexity The other layout, termed multi-layer van Emde Boas layout is slightly more involved, but allows a shorter proof The intuition behind both methods is to ensure an appropriate diversity of sizes of subtrees at each level of recursion of a van Emde Boas type layout When looing at the level of recursion where the sizes of subtrees get below the bloc size B, these sizes will then be spread out in the interval [ B,B] Under randomization of the starting position of the data structure in memory, smaller subtrees will be less liely to straddle a bloc boundary thus avoiding a second I/O for these trees during a search On the other hand, smaller subtrees mean more such subtrees on a search path, hence more I/Os The main question is what is the best possible balance between the two effects holding true for any value of the bloc size B In the multi-layer van Emde Boas layout, we specify an explicit diversity of sizes of subtrees at each level of the recursion, which we can prove achieves a balance between the two effects that entails a cache-oblivious search time asymptotically meeting our lower bound In the van Emde Boas layout, we split the tree height in a fixed, uneven way during the recursion of the van Emde Boas layout, and we are able to show that this simple scheme achieves the same result Intuitively, the recursive uneven division provides a spread of subtree sizes with the same effect as our explicit scheme 5 Upper Bound for the Generalized van Emde Boas Layout We now propose and analyze a generalized van Emde Boas layout having a better search cost In the original veb layout, the top recursive subtree and the bottom recursive subtrees have the same height except for roundoff) At first glance this even division would seem to yield the best memory-transfer cost Surprisingly, we can improve the van Emde Boas layout by selecting different heights for the top and bottom subtrees The generalized veb layout is as follows: Suppose the complete binary tree contains N = 2 h nodes and has height h = lgn Let a and b be constants such that 0 < a < and b = a Conceptually we split the tree at the edges below the nodes of depth ah This splits the tree into a top recursive subtree of height ah, and = 2 ah bottom recursive subtrees of height bh Thus, there are roughly N a bottom recursive subtrees and each bottom recursive subtree contains roughly N b nodes We map the nodes of the tree into positions in the array by recursively laying out the subtrees contiguously in memory The base case is reached when the trees have one node, as in the standard veb layout We find the values of a and b that yield a layout whose memory-transfer cost is arbitrarily close to [lge+olg/)]log B N+O) for a = /2 ξ and large enough N We focus our analysis on the first level of detail where recursive subtrees have size at most the bloc size B In our analysis memory transfers can be classified in two types There are V path-length memory transfers, which are caused by accessing different recursive subtrees in the level of detail of the analysis, and there are C page-boundary memory transfers, which are caused by a single recursive subtree in this level of detail straddling two consecutive blocs It turns out that each of these components has the same general recursive expression and differs only in the base cases The total number of memory transfers is at most V +C by linearity of expectation The recurrence relation obtained contains rounded-off terms and ) that are cumbersome to analyze Weshowthatifweignoretheroundoffoperators,thenthe errortermissmall Weobtainasolutionexpressed

12 in terms of a power series of the roots of the characteristic polynomial of the recurrence We show for both V and C that the largest root is unique and hence dominates all other roots, resulting in asymptotic expressions in terms of the dominant root Using these asymptotic expressions, we obtain the main result, namely a layout whose total cost is arbitrarily close to [lge+olg/)]log B N +O) as the split factor a = /2 ξ approaches /2 and for N large enough This performance matches the lower bound from the Section 2 up to low-order terms Causes of Memory Transfers: Path-Length and Bloc-Boundary-Crossing Functions We let Bx) denote the expected bloc cost of a search in a tree of height x To begin, we explain the base case for the recurrence, when the entire tree is a critical recursive subtree Recall that a critical recursive subtree is a recursive subtree of size less than B If a critical recursive subtree crosses a bloc boundary, then the bloc cost is 2; otherwise the bloc cost is As in the Theorem 33, the expected bloc cost of accessing a critical recursive subtree T of size T = t and height x = lgt is + t 2 B = + 2x 2 B Thus, the base case is when T < B, which means that t B and x We next give the recurrence for the bloc cost Bx) of a tree T of height x By linearity of expectation, the expected bloc cost is at most that of the top recursive subtree plus the bottom recursive subtree, ie, Bx) B ax ) + B bx ), for x >, 3 for a+b =, 0 < a b < We decompose an upper bound on) the cost of Bx) into two pieces Let Vx) be the number of critical recursive subtrees visited along a root-to-leaf path V stands for vertical ), ie, Vx) = { V ax )+V bx ), x > ;, x 2) Let Cx) be the expected number of critical recursive subtrees straddling bloc boundaries along the rootto-leaf path C stands for crossing ), ie, Cx) = { C ax )+C bx ), x > ; 2 x 2)/B, x 3) Observe that both Vx) and Cx) are monotonically increasing By linearity of expectation, we obtain Bx) Vx)+Cx) for all x The recurrences for Vx) and Cx) are both of the form Fx) = F ax ) + F bx ) As we will see, it is easier to analyze a recurrence of the form Gx) = Gax)+Gbx), where the roundoff error is removed In the next few pages, we show that Fx) can be approximated by Gx) as x increases Afterwards, we show how to calculate Gx) 3 We cannot claim equality, ie, that Bx) = B ax ) +B bx ), because the leaf node of the top recursive subtree and root node of a bottom recursive subtree can belong to the same bloc Thus, an equal sign in the recurrence might double count one memory transfer 2

13 Roundoff Error Is Small We next show that as x increases, the difference between Fx) and Gx) can be bounded To quantify the difference between Fx) and Gx) see Theorem 54 we use functions βx) and δx) defined recursively below: Definition 5 Let a < min{/2, 2/} Define the recursive function βx) and δx) as follows: { 0, x ; βx) = βax+)+, x > {, x ; δx) = δax+)+2a βx) 2 /), x > The following lemma gives upper and lower bounds of βx) Lemma 52 For all x >, the function βx) satisfies 2 a 2 x aβx) 2 2ax 4) Proof: For parameter n, define the nth interval I n to be [ I n = 2a n + ] a an, a a n We now prove the following inequality for all x > : We establish 5) in two parts 2 ) βx) x a a We first show that the inequality holds for all n and all x I n 2 We then explain that the interval I 0 I I 2 covers the interval [, ) ) βx) 5) a We now prove the first part, showing by induction on n that 5) holds for all n and all x I n Base Case: The base case is when [ a x I 0 = 2 + ] a, Because a < /2, Therefore, because x I 0, a > 0 a x 6) 2 a Because x and from Definition 5, βx) = 0 Observe that 6) is equivalent to 5) when βx) = 0 Therefore, 5) holds in the base case Induction step: Assume that 5) holds for the nth interval I n We will show that 5) also holds for the n+)st interval I n+, ie, when [ x I n+ = 2a n + ] a an, a a n+, 3

14 or equivalently when 2a n + a an x a a n+ 7) Multiplying by a and adding to both sides of 7), we see that 7) is equivalent to ie, 2a n + a an ax+ a a n, ax+ I n Thus, by induction plugging ax+ for x in 5)), we obtain 2 ) βax+) ax+ a a Noticing that βax+) = βx) by Definition 5 and we establish 2 ax+) a = a x a ) βx) 2 a x ) a a ) βax+) a ), ) βx), a which is equivalent to 5) for x I n+ We now prove the second part, that n=0 I n covers the interval [, ) This claim follows when a < 2/, which is guaranteed when B > 6 The claim follows because intervals overlap, ie, the right endpoint of the I n is between the left and right endpoints of the I n+, that is, 2a n + a an a an a a n a n+ We have now established that 5) holds for all x > We next show that 5) is equivalent to the lemma statement, ie, 4) Taing inverses on both sides of 5), we have ie, 2 aβx) x a a 2 x a2 a aβx) aβx) 2 2ax 2a a Because x > and a < 2/, we have x > 2/ a), ie, a 2 x/2 > a 2 / a) Therefore, the left side of the above inequality is less than 2/a 2 x) The right side is greater than /2ax) because 2a/ a) > 0 Thus, we prove the following 2 a 2 x aβx) 2 2ax for all x > as claimed The following lemma gives the properties and the upper bound of δx) Lemma 53 The function δx) has the following properties: 4

15 ) If βx) = βy), then δx) = δy) 2) For all x >, 3) For all x >, which is ax+)δax+) axδx) [ δx) exp ) 2 +O a a) 2 a a) ], ) = +O Proof: ) This claim follows from Definition 5 2) This claim follows from Definition 5 of δx) and Lemma 52 3) Recall that from Definition 5, we have a βx) 2 2ax δx) +2aβx) 2 δax+) = for all x > Furthermore, because +y < e y is true for any y > 0, we bound the function δ ) as follows δx) δax+) exp [2 aβx) 2 ] 8) For simplification, we define P i be the polynomial a i x+a i + + In the following, we show there exists some big integer n such that P n+ = a n+ x+a n + + < First of all, because a <, a n is arbitrary small when n goes to infinity Thus, if n is big enough, then a n+ x < 2 9) for fixed number x Second, for big B > 6, we have / a) < )/2 Thus, a n + +a+ < a < 2 0) is true for all integer n Therefore, combining both 9) and 0), we obtain that there exists some big integer n such that P n+ = a n+ x+a n + + <, which means, by Definition 5 of δx), δp n+ ) = δa n+ x + a n + + ) = Therefore, δx) can be expressed as the multiplication of n+ items, ie, δx) = δx) δax+) δax+) δa 2 x+ax+) δan x+a n + +) δa n+ x+a n + +) Using the term P i in the above equation, we get the simplified version δx) = n i=0 δp i ) δp i+ ) ) 5

16 To bound δx), we give the upper bound for δp i )/δp i+ ) first Notice that P i+ = ap i +, Replacing x by P i in 8), we have the upper bound δp i ) δp i+ ) exp [2 aβpi) 2 We claim that βp i ) = n+ i for all 0 i n+ We prove this claim by induction The base case is for P n+ From Definition 5 and P n+ <, we have βp n+ ) = 0 Assume the claim holds for some P i We prove the claim holds for P i Because P i = ap i +, we have βp i ) = βp i )+ from Definition 5 of βx) Therefore, by induction, we obtain βp i ) = βp i )+ = n+ i+= n+ i ) as claimed Thus, each of those items δp i )/δp i+ ) has the upper bound ] exp [2 an i ] Therefore, we obtain Because we prove that as claimed δx) exp [ i=0 2 n i=0 ] [ a n i 2 = exp n a i < a i = i=0 [ δx) exp a a), 2 a a) ], ] n a i i=0 Theorem 54 Roundoff Error) Let Fx) = F ax ) +F bx ) and Gx) = Gax) +Gbx) for 0 < a b < and a+b = Then for B > 8, all x >, and constant c, [ )] Fx) Gxδx)) c +O x+o) Proof: First recall that Fx) and Gx) are monotonically increasing Thus, from ax ax + and bx bx, we have Fx) Fax+)+Fbx) 2) We prove Fx) Gxδx)) inductively The base case is when < x, where δx) = from Definition 5 and Fx) = Gx) Thus, Fx) Gxδx)) is true when < x Assuming Fx) Gxδx)) is true for < x t, we prove it is true for < x t )/b Noticing that t )/b min{t/b,t )/a} because b a), we have ax + t and bx t for all < x t )/b Thus, by assumption and 2), we obtain Fx) Gax+)δax+))+Gbxδbx)), < x t )/b 3) From Condition 2) in Lemma 53 and δbx) δx), we obtain Plugging 4) into 3), we derive that Gax+)δax+)) Gaxδx)) and Gbxδbx)) Gbxδx)) 4) Fx) Gaxδx)) +Gbxδx)) = Gxδx)), < x t )/b 6

17 Therefore, after two inductive steps, it is true for and after n inductive steps, it is true for all < x t b b 2, < x t b bn b n = t bn )/ b) b n Therefore, as long as t > / b) = /a, we have Fx) Gxδx)) for all x > Thus, we need > /a, which holds when B > 8 and a > /3 Furthermore, if Gx) cx+o), then by Condition 3) in Lemma 53, we obtain the following: Fx) Gxδx)) cxδx)+o) c[+o/)]x+o) Bounding the Path-Length Function We now determine the constant in the search cost Olog B N), for given values of a and b To do so, we assume a = q and b = q m, 5) for positive real number q > and relatively prime integers m and Plugging 5) into a+b =, we obtain /q +/q m = Define n = m 6) Observe that because > m since a < b), n is positive We now have the simplified formula q = q n + 7) The rationale behind this assumption is that this additional structure helps us in the analysis while still being dense; that is, for any given a and b satisfying a+b =, we can find a and b defined as 5) that are arbitrary close to a and b Because there exits a real number r such that a = b r, we choose rational number /m,,m) = as close as desired to r Let q = b /m Then a = /q and b = /q m We call such an a,b) pair a twin power pair As before we analyze Vx) first We ignore the roundoff based on Theorem 54 Furthermore, we normalize the range for which Vx) = by introducing a function Hx) = { Hax)+Hbx), x > ;, 0 < x 8) Note that Vx) Hxδx)) by Theorem 54 First we state a primary lemma of the subsection, which we prove later Lemma 55 Let /q,/q m ) be a twin power pair, and let n = m Then for any constant ε > 0 and n c = q i + i= i=n+ q i )/ q nq n ), when x O/ε) we have Hx) c +ε)q x+o) 7

18 Corollary 56 For any constant ε > 0, the number Vx) of recursive subtrees on a root-to-leaf path is bounded by c +ε)q log B N +O), when N B O/ε) We obtain the main upper-bound result of the paper by showing that c q lge for some twin power pair Theorem 57 Path-Length Cost) For any constant ε > 0, the number of recursive subtrees on a rootto-leaf path is lge+ε)log B N +O) 443log B N +O), as the split factor a = /2 ξ approaches /2 Proof: Choose the twin power pair a = /q and b = /q such that which is equivalent to The approximate solution for the above equation is for Therefore, we have q + =, q q = q + q + ln2, From Lemma 55, for m = and therefore n = ), we have c = Thus, for large, we obtain q + a = +q 2+ln2/ 9) q )/ i q ) = i=2 c q = q q q q q )q q) ln2 = lge That is, for a given ε/2 > 0, we can choose a big constant ε such that c q lge+ε/2, 20) for all ε From Corollary 56, for a given ε/8 > 0 and the above constant ε, we can choose big constant N ε, such that Vx) c +ε/8)q log B N +O), 2) for all N N ε, Plugging 20) into 2) and noticing that q = /a < 4, we obtain as claimed Vx) lge+ε)log B N +O) 443 log B N +O) 8

19 Noticing that for big ε, we see that the split factor a approaches /2 by 9) In particular, as long as ξ 2 ln2 = 2+ln2/ ε 4 ε +ln4, it suffices that the split factor a = /2 ξ To complete the proof of Lemma 55, we establish some properties of Hx) Since Hx) is monotonically increasing, we can bound the value Hx)/x for q i x q i+ as follows: Hq i ) Hx) Hqi+ ) qi+ x q i 22) Let H min be the lower bound and H max be the upper bound of Hq i )/q i, when i is larger than a given integer j Noticing that the left part in Inequality 22) is Hq i )/q i+ H min /q and the right part in Inequality 22) is Hq i+ )/q i qh max, we obtain H min q Hx) x qh max, when x is bigger than q j We give the recurrence of H ) From 8), we have that for i 0, Plugging 5) into 23) and since n = m, we obtain Hq i+ ) = Haq i+ )+Hbq i+ ) 23) Hq i+ ) = Hq i + )+Hq i+n + ) 24) For the sae of notational simplicity, we denote α i = Hq i + ) Therefore, 24) is equivalent to α i+ = α i+n +α i 25) We define the characteristic polynomial function of Recurrence 25) as wx) = x x n Let r,r 2,,r be the possibly complex) roots of wx) We claim below that these roots are all unique The following four lemmas supply basic mathematical nowledge behind the proof of Lemma 55 Lemma 58 The roots of wx) = x x n are unique, when and n are relatively prime integers such that n < Proof: We prove this lemma by contradiction If a root r of wx) is not unique, then x r) 2 is a factor of wx), and x r is a factor of w x) = x n x n n/) Thus, r is either 0 or a root of x n n/ But 0 is not a root of wx) Therefore, r n = n/, 26) which means r < because n < ) On the other hand, because r is a root of wx), wr) = r n r n ) = 0 Plugging 26) into wr) = 0, we obtain r n = /n ), which means r > because > n ) This is the contradiction Therefore, every root of wx) is unique Becausew x) = x nx n > 0whenx > andq isarootofwx) greaterthanseeequation7)), there is one unique real root q > of wx) Without loss of generality, let r = q We now show that if the roots of the characteristic polynomial function of a series are unique, then the series in question is a linear combination of power series {rj i } of the roots 9

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program