France. Published online: 23 Apr PDF Free Download

This article was downloaded by: [the Bodleian Libraries of the University of Oxford] On: 29 August 213, At: 6:39 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 172954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3H, UK International ournal of Systems Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tsys2 Estimating optimal parameters for parallel database hardware MARTIN C. COOPER a a Departement d'informatiue, Universite de Tours, Pare de Grandmont, Tours, 372, France. Published online: 23 Apr 27. To cite this article: MARTIN C. COOPER (1992) Estimating optimal parameters for parallel database hardware, International ournal of Systems Science, 23:1, 119-125, DOI: 1.18/2772928949193 To link to this article: http://dx.doi.org/1.18/2772928949193 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content ) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

INT.. SYSTEMS SCI., 1992, VOL. 23, NO. I, 119-125 Estimating optimal parameters for parallel database hardware MARTIN C. COOPER Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 Relational database operations (selection, projection.join, etc.) can be implemented with 1% parallelism if each relation is stored as a set of bit vectors containing projections on subsets of the bits of a tuple (Ullmann 1988, 199). This representation is approximate in that spurious tuples, called false drops, may be recovered along with the genuine tuples. In small-scaleexperiments, the number of projections reuired to maintain the expected number of falsedrops at a low, constant value was reported to be linearly proportional to the number of tuples stored in the database (Ullmann 1988). We show that a previous estimate for the number of false drops (Roberts 1979), givenin the context of Bloom filters (Bloom 197), also demonstrates a near-linear behaviour, but seriously underestimates the number of falsedrops, due to an assumption of independence between the projections which is no longer valid for the parallel database hardware. We give an improved estimate for the number of false drops which agrees well with the experimental results. This allows us to determine optimal parameters for the parallel database hardware. 1. Parallel implementation of relational operations Let r be a relation (see e.g. Maier 1983). Assuming that the attributes of r are of fixed length, r can be represented by a set ofn-bit patterns, where N is the sum of the maximum number ofbits reuired to store each attribute. Furthermore, without loss ofgenerality, r can be considered as a relaion with N I-bit attributes. Ullmann (1988, 199) has described how 1% parallelism of the most important relational operations (selection, projection, join, intersection, union) can be achieved by storing a relation r as a set of projections n Sj (r), j = I, 2,...,, where the ~ are overlapping subsets of {I, 2,..., N} of fixed size n. If the number of projections is sufficiently large, r can be recovered from these projections by calculating their join, e-e ns(r) I=I Although the ineuality r S t><l_, ns(r) always holds (Maier 1983), the result may be lossy in the sense that r c t><l-, nslr). The extra tuples recovered, which are not in r, are traditionally called false drops in the present context (Roberts 1979). To give an example of the parallel implementation of a relational operation, consider the union. r, u r, of two relations r" r,. The projections ns(r,, u r,) can be calculated usmg the formula n,«,, u r,) = ns(r,), u ns(r,) from which a possibly lossy version of r, u r, can be recovered by r, u r, = t><l n s (r, U r,) j=1 Received 2 April 199. Revised 29 April 1991. t Departement d'informatiue, Universite de Tours, Pare de Grandmont, 372 Tours, France. 2-7721/92 $3. 1992 Taylor & Francis Ltd.

12 M. C. Cooper Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 If n.l/r,) and n"i(r 2 ) are stored in bit vectors l'; and Vz of length 2 n (the kth bit of V, being set to I iff the binary representation of k is a tuple in n,,(r,», then n 5 (r,) v n,,/r can be calculated in parallel by DRing the bit vectors l'; and V 2 The advantage of working with the projections n 5 (r), rather than the original relation r, is the difference between the (2 n ) OR gates reuired to OR the bit vectors representing n"j(r l ) and n 5/r2) in parallel, and the (2 N ) gates which would be reuired to OR bit vectors representing the original relations r l and r 2 Possible values ofnand N are, for example, n = 8 and N = 32. The disadvantage of working with the projections, rather than the original relation, is the presence of false drops.. 2. Estimating the number of false drops Let H be the number of N-bit patterns in the relation r. These H tuples are to be stored in the projections n,/r), where each Sj(r) contains n bits. Estimating the number of false drops for given values of n, N, and H is clearly an important part of the design of the special purpose hardware for relational operations. It allows us to determine the relationship which must exist between these four parameters in order to make false drops rare. Roberts (1979) has given an estimate erd of the number of false drops, when r is stored as sets P"..., P where each P,is the result of applying a different hash-function to the original relation r. erdis derived using an assumption of independence of the ~ which is no longer valid when the ~ are replaced by the n,,/r), due to the non-empty intersection of the Sj. We rederive erd, as a first approximation to the number offalse drops 1l><i-1 n,,/r) - r], before improving it by making a more realistic assumption of independence. Let d be the density of a projection n 5 (r), i.e. d = (number of n-tuples in n.\/r»/2 n As in Ullmann (1988), we assume that rcontains Hrandom N-bit patterns. This implies, from basic probability arguments (which are given in detail by Mullin (1983», that d = I - (I - 1/2 n )H Assuming independence of the projections, the expected number of false drops, erd is eual to (2 N - H )d because each of the 2 N - H N-bit patterns x, not in r, will be a false drop iff n,,/x) E n"j(r) for each j = I,...,, and because each of the projections has density d. Therefore erd = (2 N - H){l - (I - 1/2 n )H} or euivalently = N - log2erd nn -log2{i - (I - 1/2 This relationship between and H is plotted in Fig. I (as small diamonds) for different values of Nand n, in each case with erd fixed at t ( is plotted along the x-axis and H along the.v-axis to be consistent with the presentation of experimental results in Ullmann (1988». The results are not very sensitive to the value chosen for erd, in the sense that decreasing erd from e, to e2(where e" e2.,:; I) produces an increase in by

no more than a factor of Optimal parameters for parallel database hardware 121 Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 For example, changing e rd from 1/2 to 1/1 produces only a 3% increase in for the case N = 32, n = 8. Ullmann (1988) has reported a linear relationship in experimental trials with random N-bit patterns, although he has also noted that H must eventually flatten off 22 Q 2 Q 18 Q 16 Q 14 Q 12 Q H 1 Q 8 Q 6 Q I (a) H I(b) 4 Q 2 I 4 8 12 16 2 24 28 32 36 4 44 22 Q 2 Q 18 Q 16 Q 14 e 12 o 1 Q 8 Q 6 Q 4 Q 2 [ I 3 6 9 12 15 18 21 24 27 3 33

122 M. C. Cooper 11 1 9 8 7 Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 Figure I. 6 H 6 (e) 4 3 2 O. 1 6 8 1 12 Plot of points (, H) at which the (estimated) number of falsedrops wast: (a) when N = 32. n = 8; (b) when N = 24, n = 8; (c) when N = 24, n = 12. as increases, since H is bounded above by 2 N His results are plotted on the same graphs in Fig. I, as circled dots. For each value of, what is plotted is the average (over 2 independent trials) of the largest value of H for which there were no false drops. This is not exactly the same as the value of H for which the expected number offalse drops, e = t, but it is comparable due to the relative insensitivity ofthe results to the precise value of e. The similarity of the shapes of the curves is interesting since no theoretical explanation has previously been offered for the near-linearity of Ullmann's experimental results. The non-linearity of both curves becomes evident only outside the range of the plotted points. For the range of and H plotted in Fig. I, erd still underestimates the number of false drops by several orders ofmagnitude. This is due to the non-independence of the projections ns.(r). This non-independence implies that certain N-bit patterns are more likely to occur 'as false drops than others. To be precise, a pattern which differs in only one bit (or a small number of bits) from a pattern among the H tuples stored in the relation r is much more likely to occur as a false drop than a random N-bit pattern. We will estimate e;d, the number of false drops differing by a single bit from one of the H tuples in r. Let w be one of the tuples in r and let w' be different from won only the ith bit. For i 5., ns(w') = ns(w) and hence ns(w') is automatically in n s (r) since w E r. '}} I Thus w' will be a false drop iff ns/w') E ns,(r) for each j such tnat i E ~. The probability of this happening assuming independence again (but this time only of the ns/r) such that i E ~) is

Optimal parameters for parallel database hardware 123 assuming that bit i occurs in exactly njn of the ~. This follows from the argument above in the derivation of efd' There are H different tuples wand N different bits i, and so the expected number of such false drops w' is or euivalently efd = NH{1 - (I - 1/2 n )H- )}n/n, Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 - Nlog 2 (NH/efd) = n log, {I - (I - 1/2 n )H )} If N does not divide n, then efd is given by where n = N) mod N, N 2 = N - N) and [xl is the greatest integer less than or eual to x, More accurately, the following term should be subtracted from efd' + N(N 2- (H 2-: I) {1 - (I _ 1/2n)H-, }nin I) H(H 2- I) dn {1 _ (I _ 1/2n)H- 2}()- (I- n/n)2) to cover the possibility that w' is already in r or that two such false drops w' are identical, but this makes only a negligible difference to efd for the most interesting values of Nand H (i.e. when H ~ 2 N ). The relationship between Hand is plotted in Fig. I, as solid dots, for efd = 1 These values are much closer to the experimental results than the values of (, H) given by efd = 1- For example, the average percentage error in has been reduced from 42% to 18% for the case N = 32, n = 8. However, we are still underestimating the number of false drops, due to the following facts: (a) independence between some of the projections is still assumed (b) only those false drops which differ from a tuple of r by a single bit are counted. Although near-linear for small values of Hand, the graphs (given by efd = t or efd = t) start to flatten out as the projections become saturated. This is illustrated in Fig. 2 for the case N = 32, n = 8. The rectangle framed by the broken lines corresponds to the range of values plotted in Fig. I (a). An obvious improvement would be to use a similar calculation to estimate the number efd(b) of false drops which differ from a tuple of r in exactly b bits, i),..., i b, for b?; I. Unfortunately, this reuires an assumption of independence between b projections, where b is the expected number ofprojections containing at least one of the bits i.,..., ib. As b increases, b increases, approaching the value of. Having to assume independence between more projections means that efd(b) underestimates the number of false drops (differing from a tuple of r in b bits) by an increasing factor as b increases. In fact, for the range of values of N, n, H plotted in Fig. I, adding Eb~\efd(b) to efd = erd(l) increased our estimate of the number of false drops by no more than 2%, which in turn produced no more than a 3% increase in the value of. (I)

IZ4 M. C. Cooper 9 8 7 6 Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 Figure 2. H 5 4 3 2 1 o. ",,,,, 1 2 3 4 5 6 7 8 9 1 Plot of points (, H) at which e'd = t () and at which e'd = t (e). 3. Optimal choice of hardware parameters The hardware cost is strongly dominated by zn since bit vectors are stored, each of length zn. If a limit H on the number of N-tuples which will be stored in the relation r is known before the hardware is built, then the parameters nand can be chosen in order to minimise the cost of the hardware reuired. We wish to minimise the cost 2 n while ensuring that the expected number of false drops e'd is small (always bearing in mind that e'd is only an estimate). To ensure a given value ofe'd' and n must satisfy euation (I). Substituting for gives the hardware cost as a function of n only: I(n) = - znnlog2(n H/e'd) n log, {I - (I - 1/2 n )H '} Since finding the minimum of1(11) proved intractable analytically, we calculated 1(11) by computer for all values of II in the range I-N. Surprisingly, for H eual to every power of Z from Z3 to 2 25, the minimum value ofiwas always attained when II = I + log, H. Of course, when II was allowed to take on non-integral values, the minimum of1 was attained for a value of n close to but not exactly eual to I + log2h. Choosing n = I + log2h ensures that the expected density of a projection, d = I - (I - I/znt = I - (I - I/(ZH)t I I I I ~ I - I + 2" ~ 222! + 233! - 244! +... I - exp ( - liz) 3935 It follows that N(log2N + log2h + log2(lle'd)) 1 3457 (I + log2h)

Optimal parameters for parallel database hardware 125 (a) (b) Downloaded by [the Bodleian Libraries of the University of Oxford] at 6:39 29 August 213 (c) ~ ~ Figure 3. Tables indicating p, the ratio of the number of memory bits in the parallel hardware to the number of data bits stored, for various values of. Each entry in table (a), (b) or (c) corresponds to a data point (plotted as a circled dot) in Fig. 1(a), (b) or (c), respectively. This is D(N) assuming that efd = constant and N < H. The corresponding hardware costf = 2" is O(NH). If efd were an accurate estimate of the number of false drops, this would imply that the number of memory bits, 2" = O(NH), reuired to store the projections ns(r) would be asymptotically optimal since it is of the same order ofmagnitude as NH, the number ofbits in the original relation r. NH is clearly a lower bound on the number of memory bits reuired to store the relation. Figure 3 gives the ratio p = 2"jNH, which is the number of memory bits in the parallel architecture divided by the number of bits in the original relation, for Ullmann's experimental data (plotted as circles dots in Fig. I (a), (b) and (e». Ratios of(a) 3 4, (b) 3 4 and (e) 4 27 were attained, and in each case the ratio is decreasing, indicating that even lower ratios are attainable for larger values of and H. 4. Conclusion We have provided an estimate of the number of false drops when retrieving information from parallel database hardware, which has allowed us to estimate the optimal parameters for this hardware. Our estimate implies that the hardware size increases linearly with the number of data bits to be stored. An even more realistic model would take into account the non-randomness of the tuples stored in the database (Ahad et al. 1989), as well as analysing the results of relational operations rather than just considering the recovery of the data stored in the hardware. The near-linearity of the curves given by erd = t (within the range of values plotted in Fig. I) may be of interest in the design of Bloom filters, since Ramakrishna (1989) has shown that erd is an accurate estimate of the number of filter errors (false drops). REFERENCES AHAD, R., RAO, K. V., and McLEOD, D., 1989, ACM Trans. Databases, 14, 28. BLOOM, R. H., 197, Comm. ACM, 13,422. MAIER, D., 1983, The Theory of Relational Databases (London: Pitman). MULLIN,. K., 1983, Comm. ACM, 26, 57. RAMAKRISHNA, M. V., 1989, Comm. ACM, 32, 1237. ROBERTS, C. S., 1979, Proc. Inst, elect. electron. Engrs, 67, 1624. ULLMANN,. R., 1988, Comput.., 31, 147; 199, Proc. Instn elect. Engrs, Pt E, 137,283.

France. Published online: 23 Apr 2007.