b-bit Minwise Hashing

Size: px
Start display at page:

Download "b-bit Minwise Hashing"

Transcription

1 b-bit inwise Hashing Ping Li Department of Statistical Science Faculty of omputing and Information Science ornell University, Ithaca, NY 43 Arnd hristian König icrosoft Research icrosoft orporation Redmond, WA 9 chriso@microsoft.com ABSTRAT This paper establishes the theoretical framewor of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity e.g., resemblance with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value e.g., or, we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using may reduce the storage space at least by a factor of.3 or.7 compared to 4 or b = 3, if one is interested in resemblance.5. ategories and Subject Descriptors H..8 [Database Applications]: Data ining General Terms Algorithms, Performance, Theory. INTRODUTION omputing the size of set intersections is a fundamental problem in information retrieval, databases, and machine learning. Given two sets, S and S, where S, S Ω = {,,,..., D }, a basic tas is to compute the joint size a = S S, which measures the un-normalized similarity between S and S. The resemblance, denoted by R, is a normalized similarity measure: R = S S S S = a f + f a, where f = S, f = S. In large datasets encountered in information retrieval and databases, efficiently computing the joint sizes is often highly challenging [3,8]. Detecting duplicate web pages is a classical example [4,6]. Typically, each Web document can be processed as a bag of shingles, where a shingle consists of w contiguous words in a document. Here w is a tuning parameter and was set to be w = 5 in several studies [4, 6, ]. learly, the total number of possible shingles is huge. onsidering merely 5 unique English words, the total number of possible 5-shingles should be D = 5 5 = O 5. Prior studies used D = 64 [] and D = 4 [4, 6].. inwise Hashing In their seminal wor, Broder and his colleagues developed minwise hashing and successfully applied the technique to duplicate Supported by icrosoft, NSF-DS and ONR-YIP. opyright is held by the International World Wide Web onference ommittee IW3. Distribution of these papers is limited to classroom use, and personal use by others. WWW, April 6 3,, Raleigh, North arolina, USA. A //4. Web page removal [4, 6]. Since then, there have been considerable theoretical and methodological developments [5, 8, 9, 3, 6]. As a general technique for estimating set similarity, minwise hashing has been applied to a wide range of applications, for example, content matching for online advertising [3], detection of large-scale redundancy in enterprise file systems [4], syntactic similarity algorithms for enterprise information management [7], compressing social networs [9], advertising diversification [7], community extraction and classification in the Web graph [], graph sampling [9], wireless sensor networs [5], Web spam [4,33], Web graph compression [7], and text reuse in the Web []. Here, we give a brief introduction to this algorithm. Suppose a random permutation π is performed on Ω, i.e., π : Ω Ω, where Ω = {,,..., D }. An elementary probability argument shows that Pr minπs = minπs = S S = R. S S After minwise independent permutations, π, π,..., π, one can estimate R without bias, as a binomial: ˆR = {minπ j S = minπ j S }, j= Var ˆR = R R. 3 Throughout the paper, we frequently use the terms sample and sample size i.e.,. In minwise hashing, a sample is a hashed value, minπ js i, which may require e.g., 64 bits to store [].. Our ain ontributions In this paper, we establish a unified theoretical framewor for b-bit minwise hashing. Instead of using 4 bits [] or 4 bits [4, 6], our theoretical results suggest using as few as or bits can yield significant improvements. In b-bit minwise hashing, a sample consists of b bits only, as opposed to e.g., 64 bits in the original minwise hashing. Intuitively, using fewer bits per sample will increase the estimation variance, compared to 3, at the same sample size. Thus, we will have to increase to maintain the same accuracy. Interestingly, our theoretical results will demonstrate that, when resemblance is not too small e.g., R.5, the threshold used in [4, 6], we do not have to increase much. This means our proposed b-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirements at the same time. For example, when and R =.5, the estimation variance will increase at most by a factor of 3. In this case, in order not to lose accuracy, we have to increase the sample size by a factor of

2 3. If we originally stored each hashed value using 64 bits [], the improvement by using will be 64/3 =.3. Algorithm illustrates the procedure of b-bit minwise hashing, based on the theoretical results in Sec.. Algorithm The b-bit minwise hashing algorithm, applied to estimating pairwise resemblances in a collection of N sets. Input: Sets S n Ω = {,,..., D }, n = to N. Pre-processing: : Generate random permutations π j : Ω Ω, j = to. : For each set S n and each permutation π j, store the lowest b bits of min π j S n, denoted by e n,i,j, i = to b. Estimation: Use two sets S and S as an example. : ompute Ê { b } j= i= {e,i,π j = e,i,πj } =. : Estimate the resemblance by ˆR b = Êb,b,b, where,b and,b are from Theorem in Sec...3 omparisons with LSH Algorithms Locality Sensitive Hashing LSH [8,] is a set of techniques for performing approximate search in high dimensions. In the context of estimating set intersections, there exist LSH families for estimating the resemblance, the arccosine and the Hamming distance []. In [8, 6], the authors describe LSH hashing schemes that map objects to {, } i.e., -bit schemes. The algorithms for the construction, however, are problem specific. Two discovered -bit schemes are the sign random projections also nown as simhash [8] and the Hamming distance LSH algorithm proposed by []. Our b-bit minwise hashing proposes a new construction, which maps objects to {,,..., b } instead of just {, }. While our major focus is to compare with the original minwise hashing, we also conduct comparisons with the other two nown -bit schemes..3. Sign Random Projections The method of sign -bit random projections estimates the arccosine, which is cos a f f, using our notation for sets S and S. A separate technical report is devoted to comparing b-bit minwise hashing with sign -bit random projections. See That report demonstrates that, unless the similarity level is very low, b-bit minwise hashing outperforms sign random projections. The method of sign random projections has received significant attention in the context of duplicate detection. According to [8], a great advantage of simhash over minwise hashing is the smaller size of the fingerprints required for duplicate detection. The spacereduction of b-bit minwise hashing overcomes this issue..3. The Hamming Distance LSH Algorithm Sec. 4 will compare b-bit minwise hashing with the Hamming distance LSH algorithm developed in [] and surveyed in []: When the Hamming distance LSH algorithm is implemented naively, to achieve the same level of accuracy, its required storage space will be many magnitudes larger than that of b-bit minwise hashing in sparse data i.e., S i /D is small. If we only store the non-zero locations in the Hamming distance LSH algorithm, then its required storage space will be about one magnitude larger e.g., to 3 times.. THE FUNDAENTAL RESULTS onsider two sets, S and S, S, S Ω = {,,,..., D }, f = S, f = S, a = S S Apply a random permutation π on S and S : π : Ω Ω. Define the minimum values under π to be z and z : z = min π S, z = min π S. Define e,i = ith lowest bit of z, and e,i = ith lowest bit of z. Theorem derives the main probability formula. THEORE. Assume D is large. b E b = Pr {e,i = e,i } = =,b +,b R i= where 4 r = f D, r = f D,,b = A,b r r + r + A,b r r + r, 5,b = A,b r r + r + A,b r r + r, 6 A,b = r [ r ] b [ r, A ] b,b = r [ r ] b. 7 b [ r ] For a fixed r j where j {, }, A j,b is a monotonically decreasing function of,, 3,... For a fixed b, A j,b is a monotonically decreasing function of r j [, ], reaching a limit: lim A j, r j. 8 b Proof: See Appendix A. Theorem says that, for a given b, the desired probability 4 is determined by R and the ratios, r = f D and r = f. The only assumption needed in the proof of Theorem is that D should be large, which is always satisfied in practice. A j,b j {, } is a decreasing function of r j and A j,b. As b increases, A b j,b converges to zero very quicly. In fact, when b 3, one can essentially view A j,b =.. An Intuitive Heuristic Explanation A simple heuristic argument may provide a more intuitive explanation of Theorem. onsider. One might expect that Pr e, = e, =Pr e, = e, z = z Pr z = z +Pr e, = e, z z Pr z z?? R + + R R =, because when z and z are not equal, the chance that their last bits are equal may be approximately. This heuristic argument is actually consistent with Theorem when r, r. According to 8, as r, r, we have A,, A,, and,,, also; and hence the probability 4 approaches +R. In practice, when a very accurate estimate is not necessary, one might actually use this approximate formula to simplify the estimator. The errors, however, could be quite noticeable when r, r are not negligible; see Sec The Unbiased Estimator Theorem suggests an unbiased estimator ˆR b for R: ˆR b = Êb,b, 9,b { Ê b } {e,i,πj = e,i,πj } =, j= i=

3 where e,i,πj e,i,πj denotes the ith lowest bit of z z, under the permutation π j. Following property of binomial distribution, Var Êb Var ˆRb = [,b ] = E b E b [,b ] = [,b +,b R] [,b,b R] [,b ] For large b, Var ˆRb converges to the variance of ˆR, the estimator for the original minwise hashing: lim Var R R ˆRb = = Var ˆR. b In fact, when b 3, Var ˆRb and Var ˆR are numerically indistinguishable for practical purposes..3 The Variance-Space Trade-off As we decrease b, the space needed for storing each sample will be smaller; the estimation variance at the same sample size, however, will increase. This variance-space trade-off can be precisely quantified by the storage factor Bb; R, r, r : Bb; R, r, r = b Var ˆRb = b [,b +,b R] [,b,b R] [,b ]. Lower Bb is better. The ratio, Bb ;R,r,r Bb ;R,r,r, measures the improvement of using b = b e.g., b = over using b = b e.g., b = 64. Some algebra yields the following Theorem. THEORE. If r = r and b > b, then Bb ; R, r, r Bb ; R, r, r = b b A,b R + R A,b R + R is a monotonically increasing function of R [, ]. If R which implies r r, then A,b A,b, 3 Bb ; R, r, r Bb ; R, r, r b b A,b A,b. 4 If r = r, b =, b 3 hence we treat A,b =, then Bb ; R, r, r B; R, r, r = b R 5 R + r Proof: We omit the proof due to its simplicity. Suppose the original minwise hashing used 64 bits to store each sample, then the maximum improvement of b-bit minwise hashing would be 64-fold, attained when r = r = and R =, according to 5. In the least favorable situation, i.e., r, r, the improvement will still be 64 =.3-fold when R =.5. 3 Fig. plots B64, to directly visualize the relative improvements, which are consistent with what Theorem predicts. The Bb plots show that, when R is very large which is the case in many practical applications, it is always good to use. However, when R is small, using larger b may be better. The cut-off point depends on r, r, R. For example, when r = r and both are small, it would be better to use than if R <.4, as shown in Fig.. Improvement Improvement r = r = b = Resemblance R 5 r = r = b = Resemblance R Improvement Improvement r = r =. b = Resemblance R b = 3 r = r = Resemblance R Figure : B64, the relative storage improvement of using b = Bb,, 3, 4 bits, compared to using 64 bits. Bb is defined in. 3. EXPERIENTS Experiment is a sanity chec, to verify: A our proposed estimator ˆR b in 9, is indeed unbiased; and B its variance follows the prediction by our formula in. Experiment is a duplicate detection tas using a icrosoft proprietary collection of,, news articles. Experiment 3 is another duplicate detection tas using 3, UI NYTimes news articles. 3. Experiment The data, extracted from icrosoft Web crawls, consists of pairs of sets i.e., total words. Each set consists of the document IDs which contain the word at least once. Thus, this experiment is for estimating word associations. Table : Ten pairs of words used in Experiment. For example, KONG and HONG correspond to the two sets of document IDs which contained word KONG and word HONG respectively. Word Word r r R B3 B64 B B KONG HONG RIGHTS RESERVED OF AND GABIA KIRIBATI UNITED STATES SAN FRANISO REDIT ARD TIE JOB LOW PAY A TEST Table summarizes the data and also provides the theoretical improvements, B3 B64 and. The words were selected to include B B highly frequent word pairs e.g., OF-AND, highly rare word pairs e.g., GABIA-KIRIBATI, highly unbalanced pairs e.g., A-Test, highly similar pairs e.g, KONG-HONG, as well as word pairs that are not quite similar e.g., LOW-PAY. We estimate the resemblance using the original minwise hashing estimator ˆR and the b-bit estimator ˆR b,, Validating the Unbiasedness Figure presents the estimation biases for the selected word pairs. ly, both estimators, ˆR and ˆR b, are unbiased i.e., the y-axis in Figure should be zero, after an infinite number of repetitions. Figure verifies this fact because the empirical biases are all very small and no systematic biases can be observed.

4 Bias 4 x 4 x 4 b=3 KONG HONG b=3 5 b= b= b= b= b= 4 5 b= 6 b = 3 A TEST b = 3 b= Sample size Sample size Bias Figure : biases from 5 simulations at each sample size. denotes the original minwise hashing. 3.. Validating the Variance Formula Figure 3 plots the empirical mean square errors SE = variance + bias in solid lines, and the theoretical variances in dashed lines, for 6 word pairs instead of pairs, due to the space limit. All dashed lines are invisible because they overlap with the corresponding solid curves. Thus, this experiment validates that the variance formula is accurate and ˆR b is indeed unbiased otherwise, SE will differ from the variance. ean square error SE ean square error SE ean square error SE b= KONG HONG b = 3 Theor. 3 Sample size b= b = 3 Theor. 3 OF AND 4 3 Sample size b= b= b = 3 3 Theor. 3 LOW PAY 4 3 Sample size ean square error SE ean square error SE ean square error SE 3 b= b = 3 Theor. RIGHTS RESERVED 4 3 Sample size 3 b= 3 b = 3 Theor. GABIA KIRIBATI 4 3 Sample size b= b = 3 b= Theor. b=3 3 4 A TEST 3 Sample size Figure 3: ean square errors SEs. denotes the original minwise hashing. Theor. denotes the theoretical variances of Var ˆR b and Var ˆR 3. The dashed curves, however, are invisible because the empirical SEs overlapped the theoretical variances. At the same, Var ˆR > Var ˆR > Var ˆR 3 > Var ˆR. However, ˆR ˆR only requires bit bits per sample, while ˆR requires 3 or 64 bits. 3. Experiment : icrosoft News Data To illustrate the improvements by the use of b-bit minwise hashing on a real-life application, we conducted a duplicate detection experiment using a corpus of 6 news documents. The dataset was crawled as part of the BLEWS project at icrosoft [5]. We computed pairwise resemblances for all documents and retrieved documents pairs with resemblance R larger than a threshold R. We estimate the resemblances using ˆR b with,, 4 bits, and the original minwise hashing using 3 bits. Figure 4 presents the precision & recall curves. The recall values bottom two panels in Figure 4 are all very high and do not differentiate the estimators. Precision Precision Precision Recall b= b= R =.3 b= b= b= Sample size b= R =.5 b= b= b= Sample size b= R = Sample size Recall R =.3 b= b= b=4 b= b= b= Sample size Precision Precision Precision Recall b= R =.4 b= b= b= Sample size b= R =.6 b= b= b= Sample size b= R =.8 b= b= b= Sample size Recall R =.8 b= b= b= Sample size Figure 4: icrosoft collection of news data. The tas is to retrieve news article pairs with resemblance R R. The recall curves bottom two panels indicate all estimators are equally good in recalls. The precision curves are more interesting for differentiating estimators. For example, when R =.4 top right panel, in order to achieve a precision =.8, the estimators ˆR, ˆR4, ˆR, and ˆR require = 5, 5,, 45 samples, respectively, indicating ˆR 4, ˆR, and ˆR respectively improve ˆR by 8-fold,.7-fold, and -fold. The precision curves for ˆR 4 using 4 bits per sample and ˆR using 3 bits per sample are almost indistinguishable, suggesting a 8-fold improvement in space using. When using or, the space improvements are normally around -fold to -fold, compared to ˆR, especially for achieving high precisions e.g.,.9. This experiment again confirms the significant improvement of the b-bit minwise hashing using or. Table summarizes the relative improvements. In this experiment, ˆR only used 3 bits per sample. For even larger applications, however, 64 bits per sample may be necessary []; and the improvements of ˆR b will be even more significant. Note that in the context of Web document duplicate detection, in addition to shingling, a number of specialized hash-signatures have been proposed, which leverage properties of natural-language

5 text such as the placement of stopwords [3]. However, our approach is not aimed at any specific type of data, but is a general, domain-independent technique. Also, to the extent that other approaches rely on minwise hashing for signature computation, these may be combined with our techniques. Table : Relative improvement in space of ˆR b using b bits per sample over ˆR 3 bits per sample. For precision =.9,., we find the required sample sizes from Figure 4 for ˆR and ˆR b and use them to estimate the required storage in bits. The values in the table are the ratios of the storage costs. The improvements are consistent with the theoretical predictions in Figure. R Precision =.9 Precision = Experiment 3: UI NYTimes Data We conducted another duplicate detection experiment on a public UI collection of 3, NYTimes articles. The purpose is to ensure that our experiment will be repeatable by those who can not access the proprietary data in Experiment. Figure 5 presents the precision curves for representative threshold R s. The recall curves are not shown because they could not differentiate estimators, just lie in Experiment. The curves confirm again that using or bits, ˆRb could improve the original minwise hashing using 3 bits per sample by a factor of or more. The curves for ˆR b with almost always overlap with the curves for ˆR, verifying an expected 8-fold improvement. Precision Precision b= R =.5 b= b= b= Sample size b= R =.7 b= b= b= Sample size Precision Precision b= b= b= b=4.3. R = Sample size b= R =.8 b= b= b= Sample size Figure 5: UI collection of NYTimes data. The tas is to retrieve news article pairs with resemblance R R. 4. OPARISONS WITH THE HAING DISTANE LSH ALGORITH The Hamming distance LSH algorithm proposed in [] is an influential -bit LSH scheme. In this algorithm, a set S i, is mapped into a D-dimensional binary vector, y i : y it =, if t S i ; y it =, otherwise. coordinates are randomly sampled from Ω = {,,..., D }. We denote the samples of y i by h i, where h i = {h ij, j = to } is a -dimensional vector. These samples will be used to estimate the Hamming distance H using S, S as an example: H = D i= [y i y i ] = S S S S = f + f a. Using the samples h and h, an unbiased estimator of H is simply = D Ĥ = D [h j h j ], 6 j= whose variance would be Var Ĥ = D [ E [h j h j] E [h j h ] j] [ D i= [y i y i ] D = D [ H D H D D i= [y i y i ] D ] ]. 7 The above analysis assumes D which is satisfied in practice; otherwise one should multiply the Var Ĥ in 7 by D, D the finite sample correction factor. It would be interesting to compare Ĥ with b-bit minwise hashing. In order to estimate H, we need to convert the resemblance estimator ˆR b 9 to Ĥb: ˆR b Ĥ b = f + f + ˆR f + f = ˆR b b + ˆR f + f. 8 b The variance of Ĥb can be computed from Var ˆRb using the [ ] x delta method in statistics note that +x = : +x Var Ĥb =Var ˆRb f + f + R 4r + r =Var ˆRb + R 4 D + O + O. 9 Recall r i = f i /D. To verify the variances in 7 and 9, we conduct experiments using the same data as in Experiment. This time, we estimate H instead of R, using both Ĥ 6 and Ĥb 8. Figure 6 reports the mean square errors, together with the theoretical variances 7 and 9. We can see that the theoretical variance formulas are accurate. When the data is not dense, the estimator Ĥb 8 given by b-bit minwise hashing is much more accurate than the estimator Ĥ 6. However, when the data is dense e.g., OF-AND, Ĥ could still outperform Ĥb. We now compare the actual storage needed by Ĥb and Ĥ. We define the following two ratios to mae fair comparisons: Var Ĥ Var Ĥ r +r 64 W b =, G b =. Var Ĥb b Var Ĥb b W b and G b are defined in the same spirit as the ratio of the storage factors introduced in Sec..3. Recall each sample of b-bit minwise hashing requires b bits i.e., b bits per set. If we assume each sample in the Hamming distance LSH requires bit, then W b in is a fair indicator and W b > means Ĥb outperforms Ĥ. However, as can be verified in Fig. 6 and Fig 7, when r and r are small which is usually the case in practice, W b tends to be very

6 ean square error SE ean square error SE KONG HONG b= b= 3 4 Sample size UNITED STATES 3 H b= b= H 3 4 Sample size ean square error SE ean square error SE 3 b= b= OF AND 3 4 Sample size LOW PAY 3 b= b= 3 4 Sample size Figure 6: SEs normalized by H, for comparing Ĥ 6 with Ĥb 8. In each panel, three solid curves stand for Ĥ labeled by H, Ĥ by b=, and Ĥ by b=, respectively. The dashed lines are the corresponding theoretical variances 7 and 9, which are largely overlapped by the solid lines. When the sample size is not large, the empirical SEs of Ĥb deviate from the theoretical variances, due to the bias caused by the nonlinear transformation of Ĥb from ˆR b in 8. large, indicating a highly significant improvement of b-bit minwise hashing over the Hamming distance LSH algorithm in []. We consider in practice one will most liely implement the algorithm by only storing non-zero locations. In other words, for set S i, only r i locations need to be stored each is assumed to use 64 bits. Thus, the total bits on average will be r +r 64 per set. In fact, we have the following Theorem for G b when r, r. THEORE 3. onsider r, r, and G b as defined in. If R, then G b 8 b. b If R, then G b 64 b. b b Proof: We omit the proof due to its simplicity. Figure 7 plots W and G, for r = r = 6, 4,.,.,, which are probably reasonable in practice, as well as r = r =.9 as a sanity chec. Note that, not all combinations of r, r, R are possible. For example, when r = r =, then R has to be. W W, r = r r = e 6 e 4... r = Resemblance R G G, r = r r = e 6 to Resemblance R H H r =.9 Figure 7: W and G as defined in. We consider r = 6, 4,.,.,.,.9. Note that not all combinations of r, r, R are possible. The plot for G also verifies the theoretical limits proved in Theorem 3. Figure 7 confirms our theoretical results. W will be extremely large, when r, r are small. However, when r is very large e.g.,.9, it is possible that W <, meaning that the Hamming distance LSH could still outperform b-bit minwise in dense data. By only storing the non-zero locations, Figure 7 illustrates that b-bit minwise hashing will outperform the Hamming distance LSH algorithm, usually by a factor of for small R to 3 for large R and r r. 5. DISUSSIONS 5. omputational Overhead The previous results establish the significant reduction in storage requirements possible using b-bit minwise hashing. This section demonstrates that these also translate into significant improvements in computational overhead in the estimation phrase. The computational cost in the preprocessing phrase, however, will increase. 5.. Preprocessing Phrase In the preprocessing phrase, we need to generate minwise hashing functions and apply them to all the sets for creating fingerprints. This phrase is actually fairly fast [4] and is usually done off-line, incurring a one-time cost. Also, sets can be individually processed, meaning that this step is easy to parallelize. The computation required for b-bit minwise hashes differs from the computation of traditional minwise hashes in two respects: A we require a larger number of smaller-sized samples, in turn requiring more hashing and B the pacing of b-bit samples into 64- bit or 3-bit words requires additional bit-manipulation. It turns out the overhead for B is small and the overall computation time scales nearly linearly with ; see Fig. 8. As we have analyzed, b-bit minwise hashing only requires increasing by a small factor such as 3. Therefore, we consider the overhead in the preprocessing stage not to be a major issue. Also, it is important to note that b-bit minwise hashing provides the flexibility of trading storage with preprocessing time by using b >. Time sec 4.5 x bits 3.5 bit FP U U Sample size # hashing Figure 8: Running time in the preprocessing phrase on K news articles. 3 hashing functions were used: -universal hashing labeled by -U, 4-universal hashing labeled by 4-U, and full permutations labeled by FP. Experiments with -bit hashing are reported in 3 dashed lines, which are only slightly higher due to additional bit-pacing than their corresponding solid lines the original minwise hashing using 3-bit. The experiment in Fig. 8 was conducted on K articles from the BLEWS project [5]. We considered 3 hashing functions: first, -universal hash functions computed using the fast universal hashing scheme described []; second, 4-universal hash-functions computed using the Wtric algorithm of [3]; and finally full random permutations computed using the Fisher-Yates shuffle [3]. 5.. Estimation Phrase We have shown earlier that, when R.5 and, we expect a storage reduction of at least a factor of.3, compared to using

7 64 bits. In the following, we will analyze how this impacts the computational overhead of the estimation. Here, the ey operation is the computation of the number of identical b-bit samples. While standard hash signatures that are multiples of 6-bit can easily be compared using a single machine instruction, efficiently computing the overlap between b-bit samples for small b is less straightforward. In the following, we will describe techniques for computing the number of identical b-bit samples when these are stored in a compact manner, meaning that individual b-bit samples e,i,j and e,i,j, i =,..., b, j =,... are paced into arrays A l [,..., b ], l =, of w-bit words. To w compute the number of identical b-bit samples, we iterate through the arrays; for an each offset h, we first compute v = A [h] A [h], where denotes the bitwise-xor. Subsequently, the h-th bit of v will be set if and only if the h-th bits in A [h] and A [h] are different. Hence, to compute the number of overlapping b-bit samples encoded in A [h] and A [h], we need to compute the number of b-bit blocs ending at offsets divisible by b that only contain s. The case of corresponds to the problem of counting the number of -bits in a word. We tested different methods suggested in [34] and found the fastest approach to be pre-computing an array bits[,..., 6 ], such that bits[t] corresponds to the number of - bits in the binary representation of t. Then we can compute the number of -bits in v in case of w = 3 as c = bits[v & xffffu] + bits[v 6 & xffffu]. Interestingly, we can use the same method for the cases where b >, as we only need to modify the values stored in bits, setting bits[i] to the number of b-bit blocs that only contain -bits in the binary representation of i. We evaluated this approach using a loop computing the number of identical samples in two signatures covering a total of.8 billion 3-bit words using a 64-bit Intel 66 Processor. Here, the - bit hashing requires.67x the time that the 3-bit minwise hashing requires.the results were essentially identical for. ombined with the reduction in overall storage for a given accuracy level, this means a significant speed improvement in the estimation phase: suppose in the original minwise hashing, each sample is stored using 64 bits. If we use -bit minwise hashing and consider R >.5, our previous analysis has shown that we could gain a storage reduction at least by a factor of 64/3 =.3 fold. The improvement in computational efficiency would be.3/.67 =.8 fold, which is still significant. 5. Reducing Storage Overhead for r and r The unbiased estimator ˆR b 9 requires nowing r = f D and r = f D. The storage cost could be a concern if r r must be represented with a high accuracy e.g., 64 bits. This section illustrates that we only need to quantize r and r into Q levels, where Q = 4 is probably good enough and Q = 8 is more than sufficient. In other words, for each set, we only need to increase the total storage by 4 bits or 8 bits, which are negligible. For simplicity, we carry out the analysis for and r = r = r. In this case, A, = A, =, =, = r, and the r correct estimator, denoted by ˆR,r would be ˆR,r = rê r, Bias ˆR,r = E ˆR,r R =, r + R R Var ˆR,r =. See the definition of Ê in. Now, suppose we only store an approximate value of r, denoted by r. The corresponding approximate estimator is denoted by ˆR, r : ˆR, r = rê r, r r R Bias ˆR, r = E ˆR, r R =, r r + R R r Var ˆR, r = r. Thus, the absolute bias is upper bounded by r r in the worst case, i.e., R = and r =. Using Q = 4 levels of quantization, the bias is bounded by /6 =.65. In a reasonable situation, e.g., R.5, the bias will be much smaller than.65. Of course, if we increase the quantization levels to Q = 8, the bias < /56 =.39 will be negligible, even in the worst case. Similarly, by examining the difference of the variances, = Var ˆR,r Var ˆR, r r r 4 r r r + R R, r we can see that Q = 8 would be more than sufficient. 5.3 ombining Bits for Enhancing Performance Our theoretical and empirical results have confirmed that, when the resemblance R is reasonably high, each bit per sample may contain strong information for estimating the similarity. This naturally leads to the conjecture that, when R is close to, one might further improve the performance by looing at a combination of multiple bits i.e., b <. One simple approach is to combine two bits from two permutations using XOR operations. Recall e,,π denotes the lowest bit of the hashed value under π. Theorem has proved that E = Pr e,,π = e,,π =, +, R onsider two permutations π and π. We store x = e,,π e,,π, x = e,,π e,,π Then x = x either when e,,π = e,,π and e,,π = e,,π, or, when e,,π e,,π and e,,π e,,π. Thus T = Pr x = x = E + E, 3 which is a quadratic equation with a solution T +, R =. 4, We can estimate T without bias as a binomial. The resultant estimator for R will be biased, at small sample size, due to the nonlinearity. We will recommend the following estimator max{ ˆR ˆT, } +, / =. 5, The truncation max{., } will introduce further bias; but it is necessary and is usually a good bias-variance trade-off. We use ˆR / to indicate that two bits are combined into one. The asymptotic variance of ˆR / can be derived using the delta method Var ˆR/ = T T 4, T + O. 6 Note that each sample is still stored using bit, despite that we use / to denote this estimator.

8 Interestingly, as R, ˆR / does twice as well as ˆR : Var ˆR E lim = lim R Var R ˆR/ E + E =. 7 Recall, if R =, then r = r,, =,, and E =, +, =. On the other hand, ˆR / may not be good when R is not too large. For example, one can numerically show that Var ˆR < Var ˆR/, if R <.5774, r, r Figure 9 plots the empirical SEs for four word pairs in Experiment, for ˆR /, ˆR, and ˆR. For the highly similar pair, KONG-HONG, ˆR / exhibits superior performance compared to ˆR. For the fairly similar pair, OF-AND, ˆR / is still considerably better. For UNITED-STATES, whose R =.59, ˆR/ performs similarly to ˆR. For LOW-PAY, whose R =. only, the theoretical variance of ˆR / is very large. However, owing to the truncation in 5 i.e., the variance-bias trade-off, the empirical performance of ˆR / is not too bad. ean square error SE ean square error SE / b= KONG HONG / Theor. 3 Sample size / / b= Theor. UNITED STATES 3 Sample size ean square error SE ean square error SE 3 4 / / OF AND / Theor. 3 Sample size / / Theor. 4 LOW PAY / b= 3 Sample size Figure 9: SEs for comparing ˆR / 5 with ˆR and ˆR. Due to the bias of ˆR /, the theoretical variances Var ˆR/, i.e., 6, deviate from the empirical SEs when is small. In a summary, for applications which care about very high similarities, combining bits can reduce storage even further. 6. ONLUSION The minwise hashing technique has been widely used as a standard duplicate detection approach in the context of information retrieval, for efficiently computing set similarity in massive data sets. Prior studies commonly used 64 bits to store each hashed value. This study proposes b-bit minwise hashing, by only storing the lowest b bits of each hashed value. We theoretically prove that, when the similarity is reasonably high e.g., resemblance.5, using bit per hashed value can, even in the worst case, gain a.3-fold improvement in storage space, compared to storing each hashed value using 64 bits. We also discussed the idea of combining bits from different hashed values, to further enhance the improvement, when the target similarity is very high. Our proposed method is simple and requires only minimal modification to the original minwise hashing algorithm. We expect our method will be adopted in practice. 7. REFERENES [] Alexandr Andoni and Piotr Indy. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In ommun. A, volume 5, pages 7, 9. [] ichael Bendersy and W. Bruce roft. Finding text reuse on the web. In WSD, pages 6 7, 9. [3] Sergey Brin, James Davis, and Hector Garcia-olina. opy detection mechanisms for digital documents. In SIGOD, pages 3 49, 9. [4] Andrei Z. Broder. On the resemblance and containment of documents. In Sequences, pages 9, 997. [5] Andrei Z. Broder, oses hariar, Alan. Frieze, and ichael itzenmacher. in-wise independent permutations. Journal of omputer Systems and Sciences, 63:63 9,. [6] Andrei Z. Broder, Steven. Glassman, ar S. anasse, and Geoffrey Zweig. Syntactic clustering of the web. In WWW, pages 57 66, 997. [7] Gregory Buehrer and Kumar hellapilla. A scalable pattern mining approach to web graph compression with communities. In WSD, pages 6, 8. [8] oses S. hariar. Similarity estimation techniques from rounding algorithms. In STO, pages 38 3,. [9] Flavio hierichetti, Ravi Kumar, Silvio Lattanzi, ichael itzenmacher, Alessandro Panconesi, and Prabhaar Raghavan. On compressing social networs. In KDD, pages 9 8, 9. [] Dietzfelbinger, artin and Hagerup, Torben and Katajainen, Jyri and Penttonen, artti A reliable randomized algorithm for the closest-pair problem. Journal of Algorithms, 5:9 5, 997. [] Yon Dourisboure, Filippo Geraci, and arco Pellegrini. Extraction and classification of dense implicit communities in the web graph. A Trans. Web, 3: 36, 9. [] D. Fetterly,. anasse,. Najor, and J. Wiener. A large-scale study of the evolution of web pages. In WWW, pages , 3. [3] R.A. Fisher and F. Yates. Statistical Tables for Biological, Agricultural and edical Research. Oliver & Boyd, 8. [4] George Forman, Kave Eshghi, and Jaap Suermondt. Efficient detection of large-scale redundancy in enterprise file systems. SIGOPS Oper. Syst. Rev., 43: 9, 9. [5] ichael Gamon, Sumit Basu, Dmitriy Beleno, Danyel Fisher, atthew Hurst, and Arnd hristian König. Blews: Using blogs to provide context for news articles. In AAAI, 8. [6] Aristides Gionis and Dimitrios Gunopulos and Nic Koudas. Efficient and Tunable Similar Set Retrieval. In SIGOD, pages 47-58,. [7] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In WWW, pages 38 39, 9. [8] onia.r. Henzinge. Algorithmic challenges in web search engines. Internet athematics, :5 3, 4. [9] Piotr Indy. A small approximately min-wise independent family of hash functions. Journal of Algorithm, 38: 9,. [] Piotr Indy and Rajeev otwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STO, 9. [] Toshiya Itoh, Yoshinori Taei, and Jun Tarui. On the sample size of -restricted min-wise independent permutations and other -wise distributions. In STO, pages 7 78, 3. [] P. Li and K. hurch. A Setch Algorithm for Estimating Two-way and ulti-way Associations omputational Linguistics, pages , 7. Preliminary results appeared in HLT/ENLP 5. [3] P. Li, K. hurch and T. Hastie. One Setch For All: Theory and Applications of onditional Random Sampling. In NIPS, 8. [4] Nitin Jindal and Bing Liu. Opinion spam and analysis. In WSD, pages 9 3, 8. [5] Konstantinos Kalpais and Shilang Tang. ollaborative data gathering in wireless sensor networs using measurement co-occurrence. omputer ommu., 3:979 99, 8. [6] Eyal Kaplan, oni Naor, and Omer Reingold. Derandomized constructions of -wise almost independent permutations. Algorithmica, 55:3 33, 9. [7] Ludmila, Kave Eshghi, harles B. orrey III, Joseph Tuce, and Alistair Veitch. Probabilistic frequent itemset mining in uncertain databases. In KDD, pages 87, 9.

9 [8] Gurmeet Singh anu, Arvind Jain, and Anish Das Sarma. Detecting Near-Duplicates for Web-rawling. In WWW, 7. [9] arc Najor, Sreenivas Gollapudi, and Rina Panigrahy. Less is more: sampling the neighborhood graph maes salsa better and faster. In WSD, pages 4 5, 9. [3] Sandeep Pandey, Andrei Broder, Flavio hierichetti, Vanja Josifovsi, Ravi Kumar, and Sergei Vassilvitsii. Nearest-neighbor caching for content-match applications. In WWW, 44 45, 9. [3] artin Theobald, Jonathan Siddharth, and Andreas Paepce. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages , 8. [3] iel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In SODA, 4. [33] Tanguy Urvoy, Emmanuel hauveau, Pascal Filoche, and Thomas Lavergne. Tracing web spam with html style similarities. A Trans. Web, : 8, 8. [34] Henry S. Warren. Hacer s Delight. Addison-Wesley,. APPENDIX A. PROOF OF THEORE onsider two sets, S, S Ω = {,,,..., D }. Denote f = S, f = S, and a = S S. Apply a random permutation π on S and S : π : Ω Ω. Define the minimum values under π to be z and z : z = min π S, z = min π S. Define e,i = ith lowest bit of z, and e,i = ith lowest bit of z. b The tas is to derive Pr i= {e,i = e,i } =, which can be decomposed to be b Pr {e,i = e,i } =, z = z i= b +Pr {e,i = e,i } =, z z i= b =Pr z = z + Pr {e,i = e,i } =, z z i= b =R + Pr {e,i = e,i } =, z z. i= where R = S S S S = Pr z = z is the resemblance. When, the tas boils down to estimating Pr e, = e,, z z = Pr z = i, z = j i=,,4,... j i,j=,,4,... + Pr z = i, z = j. i=,3,5,... j i,j=,3,5,... Therefore, we need the following basic probability formula: We start with Pr z = i, z = j, i j. Pr z = i, z = j, i < j = P + P, where P 3 D D a D f P 3 =, a f a f a D j D j a D i f P =, a f a f a D j D j a D i f P =. a f a f a The expressions for P, P, and P 3 can be understood by the experiment of randomly throwing f +f a balls into D locations, labeled,,,..., D. Those f + f a balls belong to three disjoint sets: S S S, S S S, and S S. Without any restriction, the total number of combinations should be P 3. To understand P and P, we need to consider two cases:. The jth element is not in S S : = P. We first allocate the a = S S overlapping elements randomly in [j +, D ], resulting in D j a combinations. Then we allocate the remaining f a elements in S also randomly in the unoccupied locations in [j +, D ], resulting in D j a f a combinations. Finally, we allocate the remaining elements in S randomly in the unoccupied locations in [i +, D ], which has D i f f a combinations.. The jth element is in S S : = P. After conducing expansions and cancelations, we obtain Pr z = i, z = j, i < j = P + P P 3 a + D j!d i f! f a a!f a!f a!d j f!d i f f +a! = D! a!f a!f a!d f f +a! = f f ad j!d f i!d f f + a! D!D f j!d f f + a i! = f f a j i t= D f i t i t= D f f + a t j t= D t = f j i f a D f i t D D D t t= i t= D f f + a t D + i j t For convenience, we introduce the following notation: r = f D, r = f D, s = a D. Also, we assume D is large which is always satisfied in practice. Thus, we can obtain a reasonable approximation: Pr z = i, z = j, i < j =r r s [ r ] j i [ r + r s] i Similarly, we obtain, for large D, Pr z = i, z = j, i > j =r r s [ r ] i j [ r + r s] j Now we have the tool to calculate the probability Pr e, = e,, z z = Pr z = i, z = j i=,,4,... j i,j=,,4,... + Pr z = i, z = j i=,3,5,... j i,j=,3,5,... For example, again, assuming D is large Pr z =, z =, 4, 6,... =r r s [ r ] + [ r ] 3 + [ r ] r =r r s [ r ]

10 Pr z =, z = 3, 5, 7,... = r r s[ r + r s] [ r ] + [ r ] 3 + [ r ] r =r r s[ r + r s] [ r. ] Therefore, i=,,4,... + i=,3,5,... { { i<j,j=,,4,... i<j,j=,3,5,... Pr z = i, z = j Pr z = i, z = j r =r r s [ r ] + [ r + r s] + [ r + r s] +... r =r r s [ r ] r + r s. By symmetry, we now { j=,,4,... + j=,3,5,... { i>j,i=,,4,... i>j,i=,3,5,... } } Pr z = i, z = j Pr z = i, z = j r =r r s [ r ] r + r s. ombining the probabilities, we obtain Pr e, = e,, z z r r r s r r r s = + [ r ] r + r s [ r ] r + r s r s =A, r + r s + r s A, r + r s, where A,b = b r [ r] [ r ], A r [ r] b b,b = [ r ]. b Therefore, we can obtain the desired probability, for, b= Pr {e,i = e,i } = where i= r s =R + A, r + r s + r s A, r + r s f a =R + A, f + f a + f a A, f + f a f R =R + A f +R + f, f + f R +R f + f + f a A, f + f a =R + A, f Rf f + f + A, f Rf f + f =, +, R,b = A,b r r + r + A,b r r + r,b = A,b r r + r + A,b r r + r. To this end, we have proved the main result for. } } Next, we consider b >. Due to the space limit, we only provide a setch of the proof. When, we need Pr e, = e,, e, = e,, z z = Pr z = i, z = j i=,4,8,... j i,j=,4,8,... + Pr z = i, z = j i=,5,9,... j i,j=,5,9,... + Pr z = i, z = j i=,6,,... j i,j=,6,,... + Pr z = i, z = j i=3,7,,... j i,j=3,7,,... We again use the basic probability formula Pr z = i, z = j, i < j and the sum of different geometric series, for example, [ r ] 3 + [ r ] 7 + [ r ] +... = [ r ] [ r. ] Similarly, for general b, we will need [ r ] b + [ r ] b + [ r ] 3 b +... = [ r ] b [ r. ] b After more algebra, we prove the general case: b Pr {e,i = e,i } = i= r s =R + A,b r + r s + A r s,b r + r s =,b +,b R, It remains to show some useful properties of A,b same for A,b. The first derivative of A,b with respect to b is A r [ r ] b log r log [ r ] b,b = b [ r ] b [ r ] b log r log r [ r ] b [ r ] b Note that log r Thus, A,b is a monotonically decreasing function of b. Also, lim A [ r ] b r b [ r ] b,b = lim = r r b [ r ] b, b b A,b [ r] r b [ r ] b = r [ r ] b b [ r ] b r [ r ] b [ r ] b = [ r ] b [ r] b b r [ r ] b. Note that x c cx, for c and x. Therefore A,b is a monotonically decreasing function of r.

11 b-bit inwise Hashing for Estimating Three-Way Similarities Ping Li Dept. of Statistical Science ornell University Arnd hristian König icrosoft Research icrosoft orporation Wenhao Gui Dept. of Statistical Science ornell University Abstract omputing two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance Jaccard similarity using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits where b for 3-way. The extension to 3-way similarity from the prior wor on -way similarity is technically non-trivial. We develop the precise estimator which is accurate and very complicated; and we recommend a much simplified estimator suitable for sparse data. Our analysis shows that b-bit minwise hashing can normally achieve a to 5-fold improvement in the storage space required for a given estimator accuracy of the 3-way resemblance. Introduction The efficient computation of the similarity or overlap between sets is a central operation in a variety of applications, such as word associations e.g., [3], data cleaning e.g., [4, 9], data mining e.g., [4], selectivity estimation e.g., [3] or duplicate document detection [3, 4]. In machine learning applications, binary / vectors can be naturally viewed as sets. For scenarios where the underlying data size is sufficiently large to mae storing them in main memory or processing them in their entirety impractical, probabilistic techniques have been proposed for this tas. Word associations collocations, co-occurrences If one inputs a query NIPS machine learning, all major search engines will report the number of pagehits e.g., one reports 89,3, in addition to the top raned URLs. Although no search engines have revealed how they estimate the numbers of pagehits, one natural approach is to treat this as a set intersection estimation problem. Each word can be represented as a set of document IDs; and each set belongs to a very large space Ω. It is expected that Ω >. Word associations have many other applications in omputational Linguistics [3, 38], and were recently used for Web search query reformulation and query suggestions [4, ]. Here is another example. ommercial search engines display various form of vertical content e.g., images, news, products as part of Web search. In order to determine from which vertical to display information, there exist various techniques to select verticals. Some of these e.g., [9, 5] use the number of documents the words in a search query occur in for different text corpora representing various verticals as features. Because this selection is invoed for all search queries and the tight latency bounds for search, the computation of these features has to be very fast. oreover, the accuracy of vertical selection depends on the number/size of document corpora that can be processed within the allotted time [9], i.e., the processing speed can directly impact quality. Now, because of the large number of word-combinations in even medium-sized text corpora e.g., the Wiipedia corpus contains > 7 distinct terms, it is impossible to pre-compute and store the associations for all possible multi-term combinations e.g., > 4 for -way and > for 3-way; instead the techniques described in this paper can be used for fast estimates of the co-occurrences. Database query optimization Set intersection is a routine operation in databases, employed for example during the evaluation of conjunctive selection conditions in the presence of single-column indexes. Before conducting intersections, a critical tas is to quicly estimate the sizes of the intermediate results to plan the optimal intersection order [, 8, 5]. For example, consider the tas of intersecting four sets of record identifiers: A B D. Even though the final outcome will be the same, the order of the join operations, e.g., A B D or A B D, can significantly affect the performance, in particular if the intermediate results, e.g., A B, become too large for main memory and need to be spilled to dis. A good query plan aims to minimize This wor is supported by NSF DS-4, ONR YIP-N499 and icrosoft.

12 the total size of intermediate results. Thus, it is highly desirable to have a mechanism which can estimate join sizes very efficiently, especially for the lower-order -way and 3-way intersections, which could potentially result in much larger intermediate results than higher-order intersections. Duplicate Detection in Data leaning: A common tas in data cleaning is the identification of duplicates e.g., duplicate names, organizations, etc. among a set of items. Now, despite the fact that there is considerable evidence e.g., [] that reliable duplicate-detection should be based on local properties of groups of duplicates, most current approaches base their decisions on pairwise similarities between items only. This is in part due to the computational overhead associated with more complex interactions, which our approach may help to overcome. lustering ost clustering techniques are based on pair-wise distances between the items to be clustered. However, there are a number of natural scenarios where the affinity relations are not pairwise, but rather triadic, tetradic or higher e.g. [, 43]. Again, our approach may improve the performance in these scenarios if the distance measures can be expressed in the form of set-overlap. Data mining A lot of wor in data mining has focused on efficient candidate pruning in the context of pairwise associations e.g., [4], a number of such pruning techniques leverage minwise hashing to prune pairs of items, but in many contexts e.g., association rules with more than items multi-way associations are relevant; here, pruning based on pairwise interactions may perform much less well than multi-way pruning.. Ultra-high dimensional data are often binary For duplicate detection in the context of Web crawling/search, each document can be represented as a set of w-shingles w contiguous words; w = 5 or 7 in several studies [3, 4, 7]. Normally only the abscence/presence / information is used, as a w-shingle rarely occurs more than once in a page if w 5. The total number of shingles is commonly set to be Ω = 64 ; and thus the set intersection corresponds to computing the inner product in binary data vectors of 64 dimensions. Interestingly, even when the data are not too high-dimensional e.g., only thousands, empirical studies [6, 3, 6] achieved good performance using SV with binary-quantized text or image data.. inwise Hashing and SimHash Two of the most widely adopted approaches for estimating set intersections are minwise hashing [3, 4] and sign -bit random projections also nown as simhash [7, 34], which are both special instances of the general techniques proposed in the context of locality-sensitive hashing [7, 4]. These techniques have been successfully applied to many tass in machine learning, databases, data mining, and information retrieval [8, 36,,, 6, 39, 8, 4, 7, 5,, 37, 7, 4, ]. Limitations of random projections The method of random projections including simhash is limited to estimating pairwise similarities. Random projections convert any data distributions to zero-mean multivariate normals, whose density functions are determined by the covariance matrix which contains only the pairwise information of the original data. This is a serious limitation..3 Prior wor on b-bit inwise Hashing Instead of storing each hashed value using 64 bits as in prior studies, e.g., [7], [35] suggested to store only the lowest b bits. [35] demonstrated that using reduces the storage space at least by a factor of.3 for a given accuracy compared to 4, if one is interested in resemblance.5, the threshold used in prior studies [3, 4]. oreover, by choosing the value b of bits to be retained, it becomes possible to systematically adjust the degree to which the estimator is tuned towards higher similarities as well as the amount of hashing random permutations required. [35] concerned only the pairwise resemblance. To extend it to the multi-way case, we have to solve new and challenging probability problems. ompared to the pairwise case, our new estimator is significantly different. In fact, as we will show later, estimating 3-way resemblance requires b..4 Notation a 3 f a f 3 a a 3 f s 3 r r 3 s s s 3 r Figure : Notation for -way and 3-way set intersections.

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

One Permutation Hashing for Efficient Search and Learning

One Permutation Hashing for Efficient Search and Learning One Permutation Hashing for Efficient Search and Learning Ping Li Dept. of Statistical Science ornell University Ithaca, NY 43 pingli@cornell.edu Art Owen Dept. of Statistics Stanford University Stanford,

More information

Hashing Algorithms for Large-Scale Learning

Hashing Algorithms for Large-Scale Learning Hashing Algorithms for Large-Scale Learning Ping Li ornell University pingli@cornell.edu Anshumali Shrivastava ornell University anshu@cs.cornell.edu Joshua Moore ornell University jlmo@cs.cornell.edu

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Nearest Neighbor Classification Using Bottom-k Sketches

Nearest Neighbor Classification Using Bottom-k Sketches Nearest Neighbor Classification Using Bottom- Setches Søren Dahlgaard, Christian Igel, Miel Thorup Department of Computer Science, University of Copenhagen AT&T Labs Abstract Bottom- setches are an alternative

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

arxiv: v1 [stat.ml] 6 Jun 2011

arxiv: v1 [stat.ml] 6 Jun 2011 Hashing Algorithms for Large-Scale Learning arxiv:6.967v [stat.ml] 6 Jun Ping Li Department of Statistical Science ornell University Ithaca, NY 43 pingli@cornell.edu Joshua Moore Department of omputer

More information

arxiv: v1 [cs.it] 16 Aug 2017

arxiv: v1 [cs.it] 16 Aug 2017 Efficient Compression Technique for Sparse Sets Rameshwar Pratap rameshwar.pratap@gmail.com Ishan Sohony PICT, Pune ishangalbatorix@gmail.com Raghav Kulkarni Chennai Mathematical Institute kulraghav@gmail.com

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

arxiv: v1 [stat.ml] 5 Mar 2015

arxiv: v1 [stat.ml] 5 Mar 2015 Min-Max Kernels Ping Li Department of Statistics and Biostatistics Department of omputer Science Piscataway, NJ 84, USA pingli@stat.rutgers.edu arxiv:13.1737v1 [stat.ml] 5 Mar 15 ABSTRAT The min-max ernel

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

On (ε, k)-min-wise independent permutations

On (ε, k)-min-wise independent permutations On ε, -min-wise independent permutations Noga Alon nogaa@post.tau.ac.il Toshiya Itoh titoh@dac.gsic.titech.ac.jp Tatsuya Nagatani Nagatani.Tatsuya@aj.MitsubishiElectric.co.jp Abstract A family of permutations

More information

Maximum Likelihood Diffusive Source Localization Based on Binary Observations

Maximum Likelihood Diffusive Source Localization Based on Binary Observations Maximum Lielihood Diffusive Source Localization Based on Binary Observations Yoav Levinboo and an F. Wong Wireless Information Networing Group, University of Florida Gainesville, Florida 32611-6130, USA

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

A New Space for Comparing Graphs

A New Space for Comparing Graphs A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main

More information

Linear Spectral Hashing

Linear Spectral Hashing Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Efficient Document Clustering via Online Nonnegative Matrix Factorizations

Efficient Document Clustering via Online Nonnegative Matrix Factorizations Efficient Document Clustering via Online Nonnegative Matrix Factorizations Fei Wang Dept. of Statistical Science Cornell University Ithaca, NY 14853, USA fw83@cornell.edu Chenhao Tan Dept. of Computer

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Lecture 14 October 22

Lecture 14 October 22 EE 2: Coding for Digital Communication & Beyond Fall 203 Lecture 4 October 22 Lecturer: Prof. Anant Sahai Scribe: Jingyan Wang This lecture covers: LT Code Ideal Soliton Distribution 4. Introduction So

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

Lecture 4: Probability and Discrete Random Variables

Lecture 4: Probability and Discrete Random Variables Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Alexandr Andoni Piotr Indy April 11, 2009 Abstract We provide novel methods for efficient dimensionality reduction in ernel spaces.

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Generative Techniques: Bayes Rule and the Axioms of Probability

Generative Techniques: Bayes Rule and the Axioms of Probability Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2016/2017 Lesson 8 3 March 2017 Generative Techniques: Bayes Rule and the Axioms of Probability Generative

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Gaussian Mixture Distance for Information Retrieval

Gaussian Mixture Distance for Information Retrieval Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities Peter M. Aronow and Cyrus Samii Forthcoming at Survey Methodology Abstract We consider conservative variance

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Asymptotically optimal induced universal graphs

Asymptotically optimal induced universal graphs Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1+o(1))2 ( 1)/2.

More information

Spectral Hashing: Learning to Leverage 80 Million Images

Spectral Hashing: Learning to Leverage 80 Million Images Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Robust Network Codes for Unicast Connections: A Case Study

Robust Network Codes for Unicast Connections: A Case Study Robust Network Codes for Unicast Connections: A Case Study Salim Y. El Rouayheb, Alex Sprintson, and Costas Georghiades Department of Electrical and Computer Engineering Texas A&M University College Station,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Move from Perturbed scheme to exponential weighting average

Move from Perturbed scheme to exponential weighting average Move from Perturbed scheme to exponential weighting average Chunyang Xiao Abstract In an online decision problem, one makes decisions often with a pool of decisions sequence called experts but without

More information

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic Theory and Applications of Complex Networs 1 Theory and Applications of Complex Networs 2 Theory and Applications of Complex Networs Class Three The Beginning of Graph Theory Leonhard Euler wonders, can

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

Rectangular Systems and Echelon Forms

Rectangular Systems and Echelon Forms CHAPTER 2 Rectangular Systems and Echelon Forms 2.1 ROW ECHELON FORM AND RANK We are now ready to analyze more general linear systems consisting of m linear equations involving n unknowns a 11 x 1 + a

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

The Hash Function JH 1

The Hash Function JH 1 The Hash Function JH 1 16 January, 2011 Hongjun Wu 2,3 wuhongjun@gmail.com 1 The design of JH is tweaked in this report. The round number of JH is changed from 35.5 to 42. This new version may be referred

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

1. Ignoring case, extract all unique words from the entire set of documents.

1. Ignoring case, extract all unique words from the entire set of documents. CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

Indirect Sampling in Case of Asymmetrical Link Structures

Indirect Sampling in Case of Asymmetrical Link Structures Indirect Sampling in Case of Asymmetrical Link Structures Torsten Harms Abstract Estimation in case of indirect sampling as introduced by Huang (1984) and Ernst (1989) and developed by Lavalle (1995) Deville

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Algebraic Gossip: A Network Coding Approach to Optimal Multiple Rumor Mongering

Algebraic Gossip: A Network Coding Approach to Optimal Multiple Rumor Mongering Algebraic Gossip: A Networ Coding Approach to Optimal Multiple Rumor Mongering Supratim Deb, Muriel Médard, and Clifford Choute Laboratory for Information & Decison Systems, Massachusetts Institute of

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

P leiades: Subspace Clustering and Evaluation

P leiades: Subspace Clustering and Evaluation P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

PRIVACY PRESERVING DISTANCE COMPUTATION USING SOMEWHAT-TRUSTED THIRD PARTIES. Abelino Jimenez and Bhiksha Raj

PRIVACY PRESERVING DISTANCE COMPUTATION USING SOMEWHAT-TRUSTED THIRD PARTIES. Abelino Jimenez and Bhiksha Raj PRIVACY PRESERVING DISTANCE COPUTATION USING SOEWHAT-TRUSTED THIRD PARTIES Abelino Jimenez and Bhisha Raj Carnegie ellon University, Pittsburgh, PA, USA {abjimenez,bhisha}@cmu.edu ABSTRACT A critically

More information

2 Generating Functions

2 Generating Functions 2 Generating Functions In this part of the course, we re going to introduce algebraic methods for counting and proving combinatorial identities. This is often greatly advantageous over the method of finding

More information

Lecture 17 03/21, 2017

Lecture 17 03/21, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Comparative Document Analysis for Large Text Corpora

Comparative Document Analysis for Large Text Corpora Comparative Document Analysis for Large Text Corpora Xiang Ren Yuanhua Lv Kuansan Wang Jiawei Han University of Illinois at Urbana-Champaign, Urbana, IL, USA Microsoft Research, Redmond, WA, USA {xren7,

More information

Notes on Discrete Probability

Notes on Discrete Probability Columbia University Handout 3 W4231: Analysis of Algorithms September 21, 1999 Professor Luca Trevisan Notes on Discrete Probability The following notes cover, mostly without proofs, the basic notions

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Practical Polar Code Construction Using Generalised Generator Matrices

Practical Polar Code Construction Using Generalised Generator Matrices Practical Polar Code Construction Using Generalised Generator Matrices Berksan Serbetci and Ali E. Pusane Department of Electrical and Electronics Engineering Bogazici University Istanbul, Turkey E-mail:

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

are the q-versions of n, n! and . The falling factorial is (x) k = x(x 1)(x 2)... (x k + 1).

are the q-versions of n, n! and . The falling factorial is (x) k = x(x 1)(x 2)... (x k + 1). Lecture A jacques@ucsd.edu Notation: N, R, Z, F, C naturals, reals, integers, a field, complex numbers. p(n), S n,, b(n), s n, partition numbers, Stirling of the second ind, Bell numbers, Stirling of the

More information

Asymptotically optimal induced universal graphs

Asymptotically optimal induced universal graphs Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1 + o(1))2 (

More information