arxiv: v3 [cs.ds] 22 Mar 2016

Size: px

Start display at page:

Download "arxiv: v3 [cs.ds] 22 Mar 2016"

Alexis Payne
5 years ago
Views:

1 A Shifting Bloo Filter Fraewor for Set Queries arxiv: v3 [cs.ds] Mar 01 ABSTRACT Tong Yang Peing University, China Yuanun Zhong Nanjing University, China Gaogang Xie ICT, CAS, China Alex X. Liu Michigan State University Qiaobin Fu Boston University, USA Xiaoing Li Peing University, China Muhaad Shahzad North Carolina State University Zi Li Nanjing University, China This paper will appear in VLDB 01, see Set queries are fundaental operations in coputer systes and applications. This paper addresses the fundaental proble of designing a probabilistic data structure that can quicly process set queries using a sall aount of eory. We propose a Shifting Bloo Filter ShBF) fraewor for representing and querying sets. We deonstrate the effectiveness of ShBF using three types of popular set queries: ebership, association, and ultiplicity queries. The ey novelty of ShBF is on encoding the auxiliary inforation of a set eleent in a location offset. In contrast, prior BF based set data structures allocate additional eory to store auxiliary inforation. To evaluate ShBF in coparison with prior art, we conducted experients using real-world networ traces. Results show that ShBF significantly advances the state-of-the-art on all three types of set queries. 1. INTRODUCTION 1.1 Motivations Set queries, such as ebership queries, association queries, and ultiplicity queries, are fundaental operations in coputer systes and applications. Mebership queries chec whether an eleent is a eber of a given set. Networ applications, such as IP looup, pacet classification, and regular expression atching, often involve ebership queries. Association queries identify which sets) aong a pair of sets contain a given eleent. Networ architectures such as distributed servers often use association queries. For exaple, when data is stored distributively on two servers and the popular content is replicated over both servers to achieve load balancing, for any incoing query, the gateway needs to identify the servers) that contain the data corresponding to that query. Multiplicity queries chec how any ties an eleent appears in a ulti-set. A ulti-set allows eleents to appear ore than once. Networ easureent applications, such as easuring flow sizes, often use ultiplicity queries. This paper addresses the fundaental proble of designing a probabilistic data structure that can quicly process set queries, such as the above-entioned ebership, association, and ultiplicity queries, using a sall aount of eory. Set query processing speed is critical for any systes and applications, especially for networing applications as pacets need to be processed at wire speed. Meory consuption is also critical because sall eory consuption ay allow the data structure to be stored in SRAM, which is an order of agnitude faster than DRAM. Widely used set data structures are the standard Bloo Filter BF) [3] and the counting Bloo Filter CBF) [11]. Let h 1.),, h.) be independent hash functions with uniforly distributed outputs. Given a set S, BF constructs an array B of bits, where each bit is initialized to 0, and for each eleent e S, BF sets the bits B[h 1e)%],, B[h e)%] to 1. To process a ebership query of whether eleent e is in S, BF returns true if all corresponding bits are 1 i.e., returns i=1b[h ie)%]). BF has no false negatives FNs), i.e., it never says that e/ S when actually e S. However, BF has false positives FPs), i.e., it ay say that e S when actually e / S with a certain probability. Note that BF does not support eleent deletion. CBF overcoes this shortcoing by replacing each bit in BF by a counter. Given a set of eleents, CBF first constructs an array C of counters, where each counter is initialized to 0. For each eleent e in S, for each 1 i, CBF increents C[h ie)%] by 1. To process a ebership query of whether eleent e is in set S, CBF returns true if all corresponding counters are at least 1 i.e., returns i=1c[h ie)%] 1)). To delete an eleent e fro S, for each 1 i, CBF decreents C[h ie)%] by Proposed Approach In this paper, we propose a Shifting Bloo Filter ShBF) fraewor for representing and querying sets. Let h 1.),, h.) be independent hash functions with uniforly distributed outputs. In the construction phase, ShBF first constructs an array B of bits, where each bit is initialized to 0. We observe that in general a set data structure needs to store two types of inforation for each eleent e: 1) existence in- 1

2 foration, i.e., whether e is in a set, and ) auxiliary inforation, i.e., soe additional inforation such as e s counter i.e., ultiplicity) or which set that e is in. For each eleent e, we encode its existence inforation in hash values h 1e)%,, h e)%, and its auxiliary inforation in an offset oe). Instead of, or in addition to, setting the bits at locations h 1e)%,, h e)% to 1, we set the bits at locations h 1e)+oe))%,, h e)+oe))% to 1. For different set queries, the offset has different values. In the query phase, to query an eleent e, we first calculate the following locations: h 1e)%,, h e)%. Let c be the axiu value of all offsets. For each 1 i, we first read the c bits B[h ie)%], B[h ie)+1)%],, B[h ie)+c 1)%] and then calculate the existence and auxiliary inforation about e by analyzing where 1s appear in these c bits. To iniize the nuber of eory accesses, we extend the nuber of bits in ShBF to + c; thus, we need c w nuber of eory accesses in the worst case, where w is the word size. Figure 1 illustrates our ShBF fraewor Figure 1: Shifting Bloo Filter fraewor. We deonstrate the effectiveness of ShBF using three types of popular set queries: ebership, association, and ultiplicity queries Mebership Queries Such queries only deal with the existence inforation of each eleent, which is encoded in rando positions in array B. To leverage our ShBF fraewor, we treat / positions as the existence inforation and the other / positions as the auxiliary inforation, assuing is an even nuber for siplicity. Specifically, the offset function o.) = h +1.)%) + 1, where h +1.) is another hash function with uniforly distributed outputs and w is a function of achine word size w. In the construction phase, for each eleent e S, we set both the / bits B[h 1e)%],, B[h e)%] and the / bits B[h 1e)% + oe)],, B[h e)% + oe)] to 1. In the query phase, for an eleent e, if all these bits are 1, then we output e S; otherwise, we output e / S. In ters of false positive rate FPR), our analysis shows that ShBF is very close to BF with hash functions. In ters of perforance, ShBF is about two ties faster than BF because of two ain reasons. First, ShBF reduces the coputational cost by alost half because the nuber of hash functions that ShBF needs to copute is alost the half of what BF needs to copute. Second, ShBF reduces the nuber of eory accesses by half because although both ShBF and BF write bits into the array B, when querying eleent e, by one eory access, ShBF obtains two bits about e whereas BF obtains only one bit about e. 1.. Association Queries For this type of queries with two sets S 1 and S, for eleents in S 1 S, there are three cases: 1) e S 1 S, ) e S 1 S, and 3) e S S 1. For the first case, i.e., e S 1 S, the offset function oe) = 0. For the second case, i.e., e S 1 S, the offset function oe) = o 1e) = h +1 e)%)/) + 1, where h +1.) is another hash function with uniforly distributed outputs and w is a function of achine word size w. For the third case, i.e., e S S 1, the offset function oe) = o e) = o 1e) + h + e)%)/) + 1, where h +.) is yet another hash function with uniforly distributed outputs. In the construction phase, for each eleent e S 1 S, we set the bits B[h 1e)% + oe)],, B[h e)% + oe)] to 1 using an appropriate value of oe) as just described for the three cases. In the query phase, given an eleent e S 1 S, for each 1 i, we read the 3 bits B[h ie)%], B[h ie)% + o 1e)], and B[h ie)% + o e)]. If all the bits B[h 1e)%],, B[h e)%] are 1, then e ay belong to S 1 S. If all the bits B[h 1e)% + o 1e)],, B[h e)% + o 1e)] are 1, then e ay belong to S 1 S. If all the bits B[h 1e)% + o e)],, B[h e)% + o e)] are 1, then e ay belong to S S 1. There are a few other possibilities that we will discuss later in Section, that ShBF taes into account when answering the association queries. In coparison, the standard BF based association query schee, naely ibf, constructs a BF for each set. In ters of accuracy, ibf is prone to false positives whenever it declares an eleent e S 1 S in a query to be in S 1 S, whereas ShBF achieves an FPR of zero. In ters of perforance, ShBF is alost twice as fast as ibf because ibf needs hash functions and eory accesses per query, whereas ShBF needs only + hash functions and eory accesses per query Multiplicity Queries For ultiplicity queries, for each eleent e in a ultiset S, the offset function o.) = ce) 1 where ce) is e s counter i.e., the nuber of occurrences of e in S). In the construction phase, for each eleent e, we set the bits B[h 1e)%+ce) 1],, B[h e)%+ce) 1] to 1. In the query phase, for an eleent e, for each 1 i, we read the c bits B[h ie)%], B[h ie)%+1],, B[h ie)%+c 1], where c is the axiu nuber of occurrences that an eleent can appear in S. For these c bits, for each 1 j c, if all the bits B[h 1e)%+j 1],, B[h e)%+j 1] are 1, then we output j as one possible value of ce). Due to false positives, we ay output ultiple possible values. 1.3 Novelty and Advantages over Prior Art The ey novelty of ShBF is on encoding the auxiliary inforation of a set eleent in its location by the use of offsets. In contrast, prior BF based set data structures allocate additional eory to store such auxiliary inforation. To evaluate our ShBF fraewor in coparison with prior art, we conducted experients using real-world networ traces. Our results show that ShBF significantly advances the state-of-the-art on all three types of set queries: ebership, association, and ultiplicity. For ebership queries, in coparison with the standard BF, ShBF has about the sae FPR but is about ties faster; in coparison with 1MeBF [17], which represents the state-of-the-art in ebership query BFs, ShBF has 10% 19% lower FPR and ties faster query speed. For association queries, in coparison with ibf, ShBF has 1.7 ties higher probability of a clear answer, and has 1. ties faster query speed. For ultiplicity queries, in coparison with Spectral BF [], which represents the state-of-the-art in ultiplicity

3 query BFs, ShBF has ties higher correctness rate and the query speeds are coparable.. RELATED WORK We now review related wor on the three types of set queries: ebership, association, and ultiplicity queries, which are ostly based on Bloo Filters. Elaborate surveys of the wor on Bloo Filters can be found in [5, 1, 1, 19]..1 Mebership Queries Prior wor on ebership queries focuses on optiizing BF in ters of the nuber of hash operations and the nuber of eory accesses. Fan et al. proposed the Cucoo filter and found that it is ore efficient in ters of space and tie copared to BF [10]. This iproveent coes at the cost of non-negligible probability of failing when inserting an eleent. To reduce the nuber of hash coputation, Kirsch et al. proposed to use two hash functions h 1.) and h.) to siulate hash functions h 1.) + i h.))%, where 1 i ); but the cost is increased FPR [13]. To reduce the nuber of eory accesses, Qiao et al. proposed to confine the output of the hash functions within certain nuber of achine words, which reduces the nuber of eory accesses during ebership queries; but the cost again is increased FPR [17]. In contrast, ShBF reduces the nuber of hash operations and eory access by about half while eeping FPR about the sae as BF.. Association Queries Prior wor on association queries focuses on identifying the set, aong a group of pair-wise disjoint sets, to which an eleent belongs. A straightforward solution is ibf, which builds one BF for each set. To query an eleent, ibf generates a ebership query for each set s BF and finds out which sets) the unnown eleent is in. This solution is used in the Suary-Cache Enhanced ICP protocol [11]. Other notable solutions include BF [0], Blootree [1], Blooier [], Coded BF [1], Cobinatorial BF [1], SVBF [15]. A coon shortcoing of all existing schees is that if any pair of sets in the group of sets is not disjoint, these schees do not function correctly. In contrast, ShBF does not require the sets to be disjoint..3 Multiplicity Queries BF cannot process ultiplicity queries because it only tells whether an eleent is in a set. Spectral BF, which was proposed by Cohen and Matias, represents the state-of-theart schee for ultiplicity queries []. There are three versions of Spectral BF. The first version proposes soe odifications to CBF to record the ultiplicities of eleents. The second version increases only the counter with the iniu value when inserting an eleent. This version reduces FPR at the cost of not supporting updates. The third version iniizes space for counters with a secondary spectral BF and auxiliary tables, which aes querying and updating procedures tie consuing and ore coplex. Aguilar- Saborit et al. proposed Dynaic Count Filters DCF), which cobines the ideas of spectral BF and CBF, for ultiplicity queries []. DCF uses two filters: the first filter uses fixed size counters and the second filter dynaically adjusts counter sizes. The use of two filters degrades query perforance. Another well-nown schee for ultiplicity queries is the Count-Min CM) Setch [9]. We will describe the details of CM setch and how our ShBF fraewor can be applied to CM setch in Section MEMBERSHIP QUERIES In this section, we first present the construction and query phases of ShBF for ebership queries. Mebership queries are the traditional use of a BF. We use ShBF M to denote the ShBF schee for ebership queries. Second, we describe the updating ethod of ShBF M. Third, we derive the FPR forula of ShBF M. Fourth, we copare the perforance of ShBF M with that of BF. Last, we present a generalization of ShBF M. Table 1 suarizes the sybols and abbreviations used in this paper. Table 1: Sybols & abbreviations used in the paper Sybol Description size of a Bloo Filter n # of eleents of a Bloo Filter # of hash functions of a Bloo Filter opt the optial value of S a set e one eleent of a set u one eleent of a set h is) the i-th hash function FP false positive FPR false positive rate f the FP rate of a Bloo Filter p the probability that one bit is still 0 after inserting all eleents into BF BF standard Bloo Filter individual BF: the solution that builds one ibf individual BF per set ShBF Shifting Bloo Filters ShBF M Shifting Bloo Filters for ebership qrs. ShBF A Shifting Bloo Filters for association qrs. ShBF Shifting Bloo Filters for ultiplicities qrs. Qps queries per second a generalization of the notion of a set in ulti-set which ebers can appear ore than once offset.), referring to the offset value for a o.) given input w # of bits in a achine word the axiu value of offset.) for w ebership query of a single set the axiu nuber of ties c an eleent can occur in a ulti-set 3.1 ShBF M Construction Phase The construction phase of ShBF M proceeds in three steps. Let h 1.), h.),, h +1.) be + 1 independent hash functions with uniforly distributed outputs. First, we construct an array B of bits, where each bit is initialized to 0. Second, to store the existence inforation of an eleent e of set S, we calculate hash values h 1e)%,h e)%,,h e)%. To leverage our ShBF fraewor, we also calculate the offset values for the eleent e of set S as the auxiliary inforation for each eleent, naely oe) = h +1 e) % ) + 1. We will later discuss how to choose an appropriate value for w. Third, we set the bits B[h1e)%],, B[h e)%] to 1 and the other bits B[h 1e)% + oe)],, B[h e)% + oe)] to 1. Note 3

4 that oe) 0 because if oe) = 0, the two bits B[h ie)%] and B[h ie)%+oe)] are the sae bits for any value of i in the range 1 i. For the construction phase, the axiu nuber of hash operations is +1. Figure illustrates the construction phase of ShBF M Figure : Illustration of ShBF M construction phase. We now discuss how to choose a proper value for w so that for any 1 i, we can access both bits B[hie)%] and B[h ie)%+oe)] in one eory access. Note that odern architecture lie x platfor CPU can access data starting at any byte, i.e., can access data aligned on any boundary, not just on word boundaries. Let B[h ie)%] be the j-th bits of a byte where 1 j. To access bit B[h ie)%], we always need to read the j 1 bits before it. To access both bits B[h ie)%] and B[h ie)%+oe)] in one eory access, we need to access j 1+w bits in one eory access. Thus, j 1 + w w, which eans w w + 1 j. When j =, w + 1 j has the iniu value of w 7. Thus, we choose w w 7 as it guarantees that we can read both bits B[h ie)%] and B[h ie)% + oe)] in one eory access. 3. ShBF M Query Phase Given a query e, we first read the two bits B[h 1e)%] and B[h 1e)%+oe)] in one eory access. If both bits are 1, then we continue to read the next two bits B[h e)%] and B[h e)% + oe)] in one eory access; otherwise we output that e / S and the query process terinates. If for all 1 i, B[hie)%] and B[hie)% + oe)] are 1, then we output e S. For the query phase, the axiu nuber of eory accesses is. 3.3 ShBF M Updating Just lie BF handles updates by replacing each bit by a counter, we can extend ShBF M to handle updates by replacing each bit by a counter. We use CShBF M to denote this counting version of ShBF M. Let C denote the array of counters. To insert an eleent e, instead of setting bits to 1, we increent each of the corresponding counters by 1; that is, we increent both C[h ie)%] and C[h ie)% + oe)] by 1 for all 1 i. To delete an eleent e S, we decreent both C[h ie)%] and C[h ie)% + oe)] by 1 for all 1 i. In ost applications, bits for a counter are enough. Therefore, we can further reduce the nuber of eory accesses for updating CShBF M. Siilar to the analysis above, if we choose w w 7 where z is the nuber of bits for each counter, we can z guarantee to access both C[h ie)%] and C[h ie)%+oe)] in one eory access. Consequently, one update of CShBF M needs only / eory accesses. Due to the replaceent of bits by counters, array C in CShBF M uses uch ore eory than array B in ShBF M. To have the benefits of both fast query processing and sall eory consuption, we can aintain both ShBF M and CShBF M, but store array B in fast SRAM and array C in DRAM. Note that SRAM is at least an order of agnitude faster than DRAM. Array B in fast SRAM is for processing queries and array C in slow DRAM is only for updating. After each update, we synchronize array C with array B. The synchronization is quite straightforward: when we insert an eleent, we insert it to both array C and B; when we delete an eleent, we first delete it fro C, if there is at least one of the counters becoes 0, we clear the corresponding bit in B to ShBF M Analysis We now calculate the FPR of ShBF M, denoted as f ShBFM. Then, we calculate the iniu value of w so that ShBF M can achieve alost the sae FPR as BF. Last, we calculate the optiu value of that iniizes f ShBFM False Positive Rate We calculate the false positive rate of ShBF M in the following theore. Theore 1. The FPR of ShBF M for a set of n eleents is calculated as follows: where p = e n. f ShBFM 1 p) 1 p + 1 ) p 1) Proof. Let p represent the probability that one bit suppose it is at position i) in the filter B is still 0 after inserting inforation of all n eleents. For an arbitrary eleent e, if h ie)% does not point to i or i oe), where oe) = h e)%w 1)+1, then the bit at position i will still be 0, thus p is given by the following equation. p = ) n = 1 ) n When is large, we can use the identity to get the following equation for p. p = 1 ) n = 1 ) n x ) 1 1 x ) x = e, e n 3) Let X and Y be the rando variables for the event that the bit at position h i.) and the bit at position h i.)+h +1.) is 1, respectively. Thus, P {X} = 1 p. Suppose we loo at a hash pair h i, h + 1, we want to calculate P {XY }. As P {XY } = P {X} P {Y X}, next we calculate P {Y X}. There are bits on the left side of position h i. The 1s in these bits could be due to the first hash function in a pair and/or due to the second hash function in the pair. In other words, event X happens because a hash pair h j, h +1 sets the position hi to 1 during the construction phase. When event X happens, there are two cases: 1. The event X1 happens, i.e., the position h i is set to 1 by h +1, i.e., the left bits cause hi to be 1, aing X and Y independent. Thus, in this case P {Y } = 1 p.

5 . The event X happens, i.e., the position h i is set to 1 by h j. In this case, As P {X1} + P {X} =1, thus, P {Y X} = P {Y X, X1} P {X1} + P {Y X, X} P {X}. Next, we copute P {X1} and P {X}. As there are w 1 bits on the left side of position h i, there are w 1 cobinations, i.e., ) w 1 1 = w 1. Probability that any bit of the bits is 1 is 1 p. When one bit in the w 1 bits is 1, probability that this bit sets the bit at location 1 h i using the hash function h + 1 to 1 is. Therefore, w 1 P {X1} = ) w p ) 1 = 1 w 1 p. Consequently, P {X} = 1 P {X1} = p. Again there are two cases: 1. If the bit which h ix) points to is set to 1 by the left 1s, X and Y are independent, and thus P {Y } = ) w p ) 1 = 1 w 1 p.. If the bit which h ix) points to is not set to 1 by the left 1s, then it ust set one bit of the latter bits to be 1. This case will cause one bit of the latter bits after position h i to be 1. In this case, there are following two situations for the second hashing h i + h +1 : a) when the second hash points to this bit, the probability is 1; 1 w 1 b) otherwise, the probability is 1 1 w 1 ) 1 p ). When the second case above happens, P {Y X, X} is given by the following equation. P {Y X, X} = 1 p )w ) + 1 = 1 w ) p ) Integrating the two cases, we can copute P {Y X} as follows. P {Y X} = 1 p )1 p ) p ) ) 1 w p 5) The probability that all the first hashes point to bits that are 1 is 1 p ). The probability that the second hash points to a bit that is 1 is the -th power of Equation 5). Thus, the overall FPR of ShBF M is given by the following equation. f ShBFM =1 p ) =1 p ) 1 p )1 p ) + p 1 w p 1 p + 1 ) p ) )) ) Note that when w, this forula becoes the forula of the FPR of BF. Let we represent e n by p. Thus, according to equation 3, p p. Consequently, we get: f ShBFM 1 p) 1 p + 1 ) p Note that the above calculation of FPRs is based on the original Bloo s FPR forula [3]. In 00, Bose et al. pointed out that Bloo s forula [3] is slightly flawed and gave a new FPR forula []. Specifically, Bose et al. explained that the second independence assuption needed to derive f Bloo is too strong and does not hold in general, resulting in an underestiation of the FPR. In 010, Christensen et al. further pointed out that Bose s forula is also slightly flawed and gave another FPR forula [7]. Although Christensen s forula is final, it cannot be used to copute the optial value of, which aes the FPR forula practically not uch useful. Although Bloo s forula underestiates the FPR, both studies pointed out that the error of Bloo s forula is negligible. Therefore, our calculation of FPRs is still based on Bloo s forula. 3.. Optiizing Syste Paraeters Miniu Value of w: Recall that we proposed to use w w 7. According to this inequation, w 5 for 3-bit architectures and w 57 for -bit architectures. Next, we investigate the iniu value of w for ShBF M to achieve the sae FPR with BF. We plot f ShBFM of ShBF M as a function of w in Figures 3a) and 3b). Figure 3a) plots f ShBFM vs. w for n = 10000, = , and =,, and 1 and Figure 3b) plots f ShBFM vs. w for n = 10000, = 10, and = , , and The horizontal solid lines in these two figures plot the FPR of BF. Fro these two figures, we observe that when w > 0, the FPR of ShBF M becoes alost equal to the FPR of BF. Therefore, to achieve siilar FPR as of BF, w needs to be larger than 0. Thus, by using w = 5 for 3 bit and w = 57 for bit architecture, ShBF M will achieve alost the sae FPR as BF. a) = , n = b) = 10, n = Figure 3: FPR vs. w. Optiu Value of : Now we calculate the value of that iniizes the FPR calculated in Equation 1). The standard ethod to obtain the optial value of is to differentiate Equation 1) with respect to, equate it to 0, i.e., f ShBF M = 0, and solve this equation for. Unfortunately, this ethod does not yield a closed for solution for. Thus, we use standard nuerical ethods to solve f ShBF M the equation = 0 to get the optial value of for given values of, n, and w. For w = 57, differentiating Equation 1) with respect to and solving for results in the following optiu value of. opt = n Substituting the value of opt fro the equation above into Equation 1), the iniu value of f ShBFM is given by the following equation. f in ShBF M = 0.0 n 7) 5

6 3.5 Coparison of ShBF M FPR with BF FPR Our theoretical coparison of ShBF M and BF shows that the FPR of ShBF M is alost the sae as that of BF. Figure plots FPRs of ShBF M and BF using Equations 1) and ), respectively for = and n = 000, 000, 000, 10000, The dashed lines in the figure correspond to ShBF M whereas the solid lines correspond to BF. We observe fro this figure that the sacrificed FPR of ShBF M in coparison with the FPR of BF is negligible, while the nuber of eory accesses and hash coputations of ShBF M are half in coparison with BF. FP rate n 000 n 000 n 000 n n Figure : ShBF M FPR vs. BF FPR. Next, we forally arrive at this result. We calculate the iniu FPR of BF as we calculated for ShBF M in Equation 7) and show that the two FPRs are practically equal. For a ebership query of an eleent u that does not belong to set S, just lie ShBF M, BF can also report true with a sall probability, which is the FPR of BF and has been well studied in literature [3]. It is given by the following equation. f BF = ) n ) ) 1 e n ) For given values of and n, the value of that iniizes f BF is = ln = Substituting this value of n n into Equation ), the iniu value of f BF is given by the following equation. f in BF = ) 1 n ln ) 0.15 n 9) By coparing Equations 7) and 9), we observe that the FPRs of ShBF M and BF are alost the sae. Thus, ShBF M achieves alost the sae FPR as BF while reducing the nuber of hash coputations and eory accesses by half. 3. Generalization of ShBF M As entioned earlier, ShBF M reduces independent hash functions to / + 1 independent hash functions. Consequently, it calculates / locations independently and reaining / locations are correlated through the equation h ie)+o 1e) 1 i /). Carrying this construction strategy one step further, one could replace the first / hash functions with / independent hash functions and an offset o e), i.e., h je) + o e) 1 j /). Continuing in this anner, one could eventually arrive at log) + 1 hash functions. Unfortunately, it is not trivial to calculate the FPR for this case because log) is seldo an integer. In this subsection, we siplify this log ethod into a linear ethod by first using a group of 1 t 1) hash t+1 functions to calculate hash locations and then applying t+1 shifting operation t ties on these hash locations. Consider a group of hash function coprising of t + 1 eleents, i.e., h 1x), h x),..., h t+1x). After copleting the construction phase using this group of hash functions, the probability that any given bit is 0 is w + w n... w t 1 w t = 1 t + 1. To insert n eleents, we need such group insertion operations. After copleting the insertion, the probability p that one bit is still t is given by the following equation. p = 1 t + 1 ) n t + 1 n e 10) Note that this probability forula is essentially ties product of e n. Thus, we can treat our ShBFM as a partitioned Bloo filter, where the output of each hash function covers a distinct set of consecutive bits. Setting t w = aes this schee partitioned Bloo filter. The equations below calculate the FPR f for this schee. where f = 1 p ) t + 1 f group) t + 1 f group = 1 t 1 p ) 1 p ) t 1 t 1 p ) 11) ) t p 1 t ) p + p 1 t ) t p 1) Due to space liitations, we have oved the derivation of this equation to our extended version at arxiv.org. When t = 1, its false positive rate can be siplified as f = 1 p ) * 1 p + 1 ) p. Siilarly, when w goes to infinity, FPR siplifies to f = 1 p ), which is the forula for FPR of a standard Bloo filter. 3.7 False Positive Derivation In our subission, we did not present this subsection due to space liitation. This is the false positive derivation details for interested reviewers. To query a non-existent object, the false positive occurs when the hash function group returns all 1s, assuing the hash function group is h ix), o 1x),..., o tx) 1 i t+1 ). 1) If the corresponding bit of h ix) is not set by the left bits, then h ix) ust cause t bits, in the following bits, to be set to 1. There are t + 1 situations in total. Note that the r th r [0, t]) situation represents that the corresponding bits of r hash functions out of t are set by h ix), and another t r bits are not set by h ix). Considering the corresponding bits of the hash function subset < o 1x),..., o tx) >, the probability that each of these bits is set by h ix) is λ 1 = t 13)

7 and the probability that each of these bits is not set by h ix) is λ = 1 t ) 1 p ) 1) Therefore, the probability that the r th situation occurs is Ct r λ r 1 λ t r. The total probability that the t + 1 situations occur is f I = Σ t r=0c r t λ r 1 λ t r = λ 1 + λ ) t 15) ) If the corresponding bit of h ix) is set by the left bits, then the proble can be divided into t situations, where each situation has a probability of 1 t. Note that the l th l [0, t 1]) situation represents that the axiu nuber of bits in the current subgroup o 1x),..., o tx), which are set to 1 by the previous hash function group causing the corresponding bit of h ix) to be set to 1. Since our hash function group adopts partitioned Bloo filter, each hash function in the previous hash function group can cause at ost 1 bit to be set 1 in current hash function group. When the corresponding bit of h ix) in the previous hash function group locates at the first bits, then the previous t hash function group will cause 0 bit to be set 1 in current subgroup o 1x),..., o tx), because there is only one possibility that the last hash function in the previous group causes the bit of current h ix) to be set to 1. This is the 0 th situation. Siilarly, for the l th situation, if the corresponding [ bit of h ix) in the previous hash function group locates at + l, + l + 1) ), t t there are at ost l bits in the current hash function group set by the previous group, and at least t l bits in the current hash function group not set by the previous group. Therefore, for the l th situation, the probability that all the bits in the current hash function group are 1 is f l = 1 ) t Σ l r =0C r l λr 1 λ l r 1 p ) t l = 1 t λ1 + λ)l 1 p ) 1) t l The total probability that all the t situations happen is f II =Σ t 1 l =0 f l = 1 Σt 1 l =0 t λ 1 + λ ) l 1 p ) t l ) = 1 t λ1 + λ)0 1 p ) t t λ1 + λ)t 1 1 p ) t t 1) ) 17) Assuing x = Σ t l =0 λ 1 + λ ) l 1 p ) t l ), we get the following equation x λ1 + λ 1 p = x 1 p )t + λ 1 + λ ) t 1) By solving equation 1, we obtain the solution x = 1 p ) t λ 1 + λ ) t 1 p ) λ 1 + λ ) 1 p ) 19) ) 1 Probability that 1) happens is Cw p ) = 1 p ), so the probability that ) happens is 1 1 p )) = p. Cobining 1) and ), we now that when h ix) is 1, the probability that all the bits of the current group o 1x),..., o tx) are 1 is f group = 1 t 1 p ) t 1 t ) t p 1 p ) 1 p ) 1 t ) p + p 1 t ) t p 0) Probability that the corresponding bits of the first t + 1 hash functions are 1 is 1 p ) t + 1. Therefore, the false positive of the generalized ShBF M is f = 1 p ) t + 1 f group) t + 1 1) Especially, when t = 1, its false positive can be siplified as f = 1 p ) * 1 p + 1 ) p. Siilarly, when w goes to infinity, its false positive is f = 1 p ), which is the forula for standard Bloo filter.. ASSOCIATION QUERIES In this section, we first describe the construction and query phases of ShBF for association queries, which are also called ebership test. We use ShBF A to denote the ShBF schee for association queries. Second, we describe the updating ethods of ShBF A. Third, we derive the FPR of ShBF A. Last, we analytically copare the perforance of ShBF A with that of ibf..1 ShBF A Construction Phase The construction phase of ShBF A proceeds in three steps. Let h 1.),, h.) be independent hash functions with uniforly distributed outputs. Let S 1 and S be the two given sets. First, ShBF A constructs a hash table T 1 for set S 1 and a hash table T for set S. Second, it constructs an array B of bits, where each bit is initialized to 0. Third, for each eleent e S 1, to store its existence inforation, ShBF A calculates hash functions h 1e)%,, h e)% and searches e in T. If it does not find e in T, to store its auxiliary inforation, it sets the offset oe) = 0. However, if it does find e in T, to store its auxiliary inforation, it calculates the offset oe) as oe) = o 1e) = h +1 e)%)/) + 1, where h +1.) is a hash function with uniforly distributed output and w is a function of achine word size w, which we will discuss shortly. Fourth, it sets the bits B[h 1e)%+oe)],, B[h e)%+oe)] to 1. Fifth, for each eleent e S, to store its existence inforation, ShBF A calculates the hash functions and searches it in T 1. If it finds e in T 1, it does not do anything because its existence as its auxiliary inforation have already been 7

8 stored in the array B. However, if it does not find e in T 1, to store its auxiliary inforation, it calculates the offset oe) as oe) = o e) = o 1e)+h + e)%w 1)/)+1, where h +.) is also a hash function with uniforly distributed output. Last, it sets the bits B[h 1e)%+oe)],, B[h e)%+ oe)] to 1. To ensure that ShBF A can read B[h ie)%], B[h ie)% + o 1e)], and B[h ie)% + o e)] in a single eory access when querying, we let w w 7. We derived this condition w w 7 earlier at the end of Section 3.1. As the axiu value of h ie)% + o e) can be equal to + w, we append the -bit array B with w bits.. ShBF A Query Phase We assue that the incoing eleents always belong to S 1 S in the load balance application 1 for convenience. To query an eleent e S 1 S, ShBF A finds out which sets the eleent e belongs to in the following three steps. First, it coputes o 1e), o e), and the hash functions h ie)% 1 i ). Second, for each 1 i, it reads the 3 bits B[h ie)%], B[h ie)% + o 1e)], and B[h ie)% + o e)]. Third, for these 3 bits, if all the bits B[h 1e)%],, B[h e)%] are 1, e ay belong to S 1 S. In this case, ShBF A records but does not yet declare) e S 1 S. Siilarly, if all the bits B[h 1e)% + o 1e)],, B[h e)%+o 1e)] are 1, e ay belong to S 1 S and ShBF A records e S 1 S. Finally, if all the bits B[h 1e)% + o e)],, B[h e)% + o e)] are 1, e ay belong to S S 1 and ShBF A records e S S 1. Based on what ShBF A recorded after analyzing the 3 bits, there are following 7 outcoes. If ShBF A records that: 1. only e S 1 S, it declares that e belongs to S 1 S.. only e S 1 S, it declares that e belongs to S 1 S. 3. only e S S 1, it declares that e belongs to S S 1.. both e S 1 S and e S 1 S, it declares that e belongs to S 1 but is unsure whether or not it belongs to S. 5. both e S S 1 and e S 1 S, it declares that e belongs to S but is unsure whether or not it belongs to S 1.. both e S 1 S and e S S 1, it declares that e belongs to S 1 S S S all e S 1 S, e S 1 S, and e S S 1, it declares that e belongs S 1 S. Note that for all these seven outcoes, the decisions of ShBF A do not suffer fro false positives or false negatives. However, decisions through provide slightly incoplete inforation and the decision 7 does not provide any inforation because it is already given that e belongs to S 1 S. We will shortly show that the probability that decision of ShBF A is one of the decisions through 7 is very sall, which eans that with very high probability, it gives a decision with clear eaning, and we call it a clear answer..3 ShBF A Updating Just lie BF handles updates by replacing each bit by a counter, we can also extend ShBF A to handle updates by replacing each bit by a counter. We use CShBF A to denote this counting version of ShBF A. Let C denote the array of 1 The application is entioned in the first paragraph of Introduction Section. counters. To insert an eleent e, after querying T 1 and T and deterining whether oe) = 0, o 1e), or o e), instead of setting bits to 1, we increent each of the corresponding counters by 1; that is, we increent the counters C[h 1e)%+oe)],, C[h e)%+oe)] by 1. To delete an eleent e, after querying T 1 and T and deterining whether oe) = 0, o 1e), or o e), we decreent C[h ie)% + oe)] by 1 for all 1 i. To have the benefits of both fast query processing and sall eory consuption, we aintain both ShBF A and CShBF A, but store array B in fast SRAM and array C in slow DRAM. After each update, we synchronize array C with array B.. ShBF A Analysis Recall fro Section. that ShBF A ay report seven different outcoes. Next, we calculate the probability of each outcoe. Let P i denote the probability of the i th outcoe. Before proceeding, we show that h i.) + o.) and h j.) + o.), when i j, are independent of each other. For this we show that given two rando variables X and Y and a nuber z R +, where R + is the set of positive real nubers, if X and Y are independent, then X+z and Y+z are independent. As X and Y are independent, for any x R and y R, we have P X x, Y y) = P X x) P Y y) ) Adding z to both sides of all inequality signs in P X x, Y y), we get P X + z x + z, Y + z y + z) = P X x, Y y) = P X x) P Y y) 3) = P X + z x + z) P Y = z y + z) Therefore, X + z and Y + z are independent. Let n be the nuber of distinct eleents in S 1 S, and let be the nuber of hash functions. After inserting all n eleents into ShBF A, the probability p that any given bit is still 0 is given by the following equation. p = 1 1 ) n ) This is siilar to one inus the false positive probability of a standard BF. When = ln, p 0.5. n Note that the probabilities for outcoes 1,, and 3 are the sae. Siilarly, the probabilities for outcoes, 5, and are also the sae. Following equations state the expressions for these probabilities. P 1 = P = P 3 = ) P = P 5 = P = ) 5) P 7 = 0.5 ) When the incoing eleent e actually belongs to one of the three sets: S 1 S, S 1 S, and S S 1, there is one cobination each for S 1 S and S S 1 and two cobinations for S 1 S. Consequently, the total probability is P 1 + P + P 7, which equals 1. This validates our derivation of the expressions in Equation 5. As an exaple, let = n ln =10. Thus, P 1=P =P 3= ) 0.99, P =P 5=P = ) = , and P 7 = ) This exaple shows that with probability of 0.99, ShBF A gives a clear answer, and with probability of only , ShBF A gives an answer with incoplete inforation. The probability with which it

9 gives an answer with no inforation is just , which is negligibly sall..5 Coparison between ShBF A with ibf For association queries, a straightfoward solution is to build one individual BF ibf) for each set. Let n 1, n, and n 3 be the nuber of eleents in S 1, S, and S 1 S, respectively. For ibf, let 1 and be the size of the Bloo filter for S 1 and S, respectively. Table presents a coparison between ShBF A and ibf. We observe fro the table that ShBF A needs less eory, less hash coputations, and less eory accesses, and has no false positives. For the ibf, as we use the traffic trace that hits the two sets with the sae probability, ibf is optial when the two BFs use identical values for the optial syste paraeters and have the sae nuber of hash functions. Specifically, for ibf, when 1 + = n 1 + n )/ ln, the probability of answering a clear answer is ). For ShBF A, when = n 1 + n n 3)/ ln, the probability of answering a clear answer is1 0.5 ). 5. MULTIPLICITY QUERIES In this section, we first present the construction and query phases of ShBF for ultiplicity queries. Multiplicity queries chec how any ties an eleent appears in a ulti-set. We use ShBF to denote the ShBF schee for ultiplicity queries. Second, we describe the updating ethods of ShBF. Last, we derive the FPR and correctness rate of ShBF. 5.1 ShBF Construction Phase The construction phase of ShBF proceeds in three steps. Let h 1.),, h.) be independent hash functions with uniforly distributed outputs. First, we construct an array B of bits, where each bit is initialized to 0. Second, to store the existence inforation of an eleent e of ultiset S, we calculate hash values h 1e)%,, h e)%. To calculate the auxiliary inforation of e, which in this case is the count ce) of eleent e in S, we calculate offset oe) as oe) = ce) 1. Third, we set the bits B[h 1e)% + oe)],, B[h e)% + oe)] to 1. To deterine the value of ce) for any eleent e S, we store the count of each eleent in a hash table and use the siplest collision handling ethod called collision chain. 5. ShBF Query Phase Given a query e, for each 1 i, we first read c consecutive bits B[h ie)%], B[h ie)% + 1],, B[h ie)% + c 1] in c eory accesses, where c is the axiu value of ce) for any e S. In these arrays of w c consecutive bits, for each 1 j c, if all the bits B[h 1e)% + j 1],, B[h e)% + j 1] are 1, we list j as a possible candidate of ce). As the largest candidate of ce) is always greater than or equal to the actual value of ce), we report the largest candidate as the ultiplicity of e to avoid false negatives. For the query phase, the nuber of eory accesses is c. w 5.3 ShBF Updating ShBF Updating with False Negatives To handle eleent insertion and deletion, ShBF aintains its counting version denoted by CShBF, which is an array C that consists of counters, in addition to an array B of bits. During the construction phase, ShBF increents the counter C[h ie)% + oe)] 1 i ) by one every tie it sets B[h ie)% + oe)] to 1. During the update, we need to guarantee that one eleent with ultiple ultiplicities is always inserted into the filter one tie. Specifically, for every new eleent e to insert into the ultiset S, ShBF first obtains its ultiplicity z fro B as explained in Section 5.. Second, it deletes the z th ultiplicity oe) = z 1) and inserts the z + 1) th ultiplicity oe) = z). For this, it calculates the hash functions h ie)% and decreents the counters C[h ie)% + z 1] by 1 when the counters are 1. Third, if any of the decreented counters becoes 0, it sets the corresponding bit in B to 0. Note that aintaining the array C of counters allows us to reset the right bits in B to 0. Fourth, it increents the counters C[h ie)% + z] by 1 and sets the bits B[h ie)% + z] to 1. For deleting eleent e, ShBF first obtains its ultiplicity z fro B as explained in Section 5.. Second, it calculates the hash functions and decreents the counters C[h ie)% + z 1] by 1. Third, if any of the decreented counters becoes 0, it sets the corresponding bit in B as 0. Fourth, it increents the counters C[h ie)% + z ] by 1 and sets the bits B[h ie)% + z ] to 1. Note that ShBF ay introduce false negatives because before updating the ultiplicity of an eleent, we first query its current ultiplicity fro B. If the answer to that query is a false positive, i.e., the actual ultiplicity of the eleent is less than the answer, ShBF will perfor the second step and decreent soe counters, which ay cause a counter to decreent to 0. Thus, in the third step, it will set the corresponding bit in B to 0, which will cause false negatives ShBF Updating without False Negatives To eliinate false negatives, in addition to arrays B and C, ShBF aintains a hash table to store counts of each eleent. In the hash table, each entry has two fields: eleent and its counts/ultiplicities. When inserting or deleting eleent e, ShBF follows four steps shown in Figure 5. First, we obtain e s counts/ultiplicities fro the hash table instead of ShBF. Second, we delete e s z-th ultiplicity fro CShBF. Third, if a counter in CShBF decreases to 0, we set the corresponding bit in ShBF to 0. Fourth, when inserting/deleting e, we insert the z 1) th/z + 1) th ultiplicity into ShBF. insert/ delete e 1) obtain e's ultiplicities z, and update the hash table. Hash Table Each bucet stores an eleent and its counts/ultiplicities Off-chip ) delete e's z th ultiplicity fro counting ShBF X. Counting ShBFX On-chip 3) If a counter decreases to 0, set the corresponding bit in ShBF X to 0. ShBFX ) It inserts the z-1 th ultiplicity into ShBFx when deleting e. It inserts the z+1 th ultiplicity into ShBFx when inserting e. Figure 5: The update process of ShBF. Note that although the counter array C and the hash table are uch larger than the bit array B, we store B in SRAM for processing ultiplicity queries and store C and the hash table in DRAM for handling updates. 5. ShBF Analysis For ultiplicity queries a false positive is defined as reporting the ultiplicity of an eleent that is larger than its 9

10 Table : Coparison Between ShBF A and ibf. Optial Meory #hash coputations #eory accesses Probability of a clear answer false positives ibf 1+=n1+n)/ln ) YES ShBF A =n1+n-n3)/ln ) NO actual ultiplicity. For any eleent e belonging to ultiset S, ShBF only sets bits in B to 1 regardless of how any ties it appears in S. This is because every tie inforation about e is updated, ShBF reoves the existing ultiplicity inforation of the eleent before adding the new inforation. Let the total nuber of distinct eleents in set S be n. The probability that an eleent is reported to be present j ties is given by the following equation. ) f 0 1 e n ) We define a etric called correctness rate, which is the probability that an eleent that is present j ties in a ulti-set is correctly reported to be present j ties. When querying an eleent not belonging to the set, the correctness rate CR is given by the following equation. CR = 1 f 0) c 7) When querying an eleent with ultiplicity j 1 j c) in the set, the correctness rate CR is given by the following equation. CR = 1 f 0) j 1 ) Note the right hand side of the expression for CR is not ultiplied with f 0 because when e has j ultiplicities, all positions h ie) + j, where 1 i, ust be Shifting Count-in Setch Besides Spectral BF, count-in setch CM setch) can also be used to record and report the nuber of ultiplicities of each eleent [9]. As shown in Figure a), a CM setch consists of d vectors, and each vector has r counters. Each vector v i 1 i d) corresponds to a hash function h i.). When inserting an eleent e, the CM setch increents the counters v 1[h 1e)]...v d [h d e)] by 1. When querying an eleent e, the CM setch reports the iniu value of v 1[h 1e)]...v d [h d e)]. CM setch is siple and easy to ipleent, but is not eory efficient, as the inial unit is a counter instead of a bit. e r a) CM setch Figure : FPR vs. w. e oe) 1 3 r oe) b) SCM setch oe) One query on CM setch needs d hash coputations and eory accesses when the length of all counters is saller than a achine word. In such a case, we can use our shifting fraewor to halve the nuber of eory accesses and hash functions. Figure a) shows our shifting version of CM setch, called shifting count-in SCM) setch. SCM setch consists of d/ vectors v i 1 i d/), where each vector has r counters. Each vector v i corresponds to a hash function h i.). When inserting an eleent e, first, the SCM setch increents the counters v 1[h 1e)]...v d/ [h d/ e)] by 1. Second, it increents the counters v 1[h 1e) + oe)]...v d/ [h d/ e) + oe)] by 1. When querying an eleent e, the SCM setch reports the inial value of v 1[h 1e)]...v d/ [h d/ e)] and v 1[h 1e) + oe)]...v d/ [h d/ e) + oe)]. To read v i[h ie)] and v i[h ie) + oe)] in one eory accesses, we set oe) = h d/+1 %) + 1, where w w 7)/r and w is the nuber of bits in a achine word.. PERFORMANCE EVALUATION In this section, we conduct experients to evaluate our ShBF schees and side-by-side coparison with state-ofthe-art solutions for the three types of set queries..1 Experiental Setup We give a brief overview of the data we have used for evaluation and describe our experiental setup. Data set: We evaluate the perforance of ShBF and state-of-the-art solutions using real-world networ traces. Specifically, we deployed our traffic capturing syste on a 10Gbps lin of a bacbone router. To reduce the processing load, our traffic capturing syste consists of two parallel sub-systes each of which is equipped with a 10G networ card and uses netap to capture pacets. Due to high lin speed, capturing entire traffic was infeasible because our device could not access/write to eory at such high speed. Thus, we only captured 5-tuple flow ID of each pacet, which consists of source IP, source port, destination IP, destination port, and protocol type. We stored each 5-tuple flow ID as a 13-byte string, which is used as an eleent of a set during evaluation. We collected a total of 10 illion 5-tuple flow IDs, out of which illion flow IDs are distinct. To further evaluate the accuracy of our proposed schees, we also generated and used synthetic data sets. Hash functions: We collected several hash functions fro open source web site [1] and tested the for randoness. Our criteria for testing randoness is that the probability of seeing 1 at any bit location in the hashed value should be 0.5. To test the randoness of each hash function, we first used that hash function to copute the hash value of the illion unique eleents in our data set. Then, for each bit location, we calculated the fraction of ties 1 appeared in the hash values to epirically calculate the probability of seeing 1 at that bit location. Out of all hash functions, 1 hash functions passed our randoness test, which we used for evaluation of ShBF and state-of-the-art solutions. Ipleentation: We ipleented our query processing schees in C++ using Visual C++ 01 platfor. To copute average query processing speeds, we repeat our experients 1000 ties and tae the average. Furtherore, we conducted all our experients for 0 different sets of 10

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004. Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,