LOCALITY SENSITIVE HASHING FOR BIG DATA

Size: px
Start display at page:

Download "LOCALITY SENSITIVE HASHING FOR BIG DATA"

Transcription

1 1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales

2 Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions

3 NN and c-ann Queries 3 c-ann = x aka. (1+ )-ANN cr q r a b x D = {a, b, x} NN = a d-dimensional space A set of points, D = i=1 n {O i }, in d-dimensional Euclidean space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version: Return a c-ann point: i.e., its distance to q is at most c*dist(o*, q) May return a c-ann point with at least constant probability

4 Motivations 4 Theoretical interest Fundamental geometric problem: post-office problem Known to be hard due to high-dimensionality Many applications Feature vectors: Data Mining, Multimedia DB Applications based on similarity queries E.g., Quantization in coding/compression Recommendation systems Bioinformatics/Chemoinformatics

5 Applications 5 SIFT feature matching Image retrieval, understanding, scene-rendering Video retrieval

6 Challenges 6 Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial in d Approximate version (c-ann) is not much easier More details next slide Large data size 1KB for a single point with 256 dims 1B pts = 1TB image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors High-dimensionality finding similar words in D=198 million tweets [JMLR 2013]

7 Existing Solutions 7 NN: Linear scan is (practically) the best approach using linear space & time O(d 5 log(n)) query time, O(n 2d + ) space O(dn 1- (d) ) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space (1+ )-ANN LSH is the best approach using sub-quadratic space O(log(n) + 1/ (d-1)/2 ) query time, O(n*log(1/ )) space Probabilistic test remove exponential dependency on d Fast JLT: O(d*log(d) + -3 log 2 (n)) query time, O(n max(2, ^-2) ) space LSH-based: Õ(dn +o(1) ) query time, Õ(n 1+ +o(1) + nd) space = 1/(1+ ) + o c (1)

8 Approximate NN for Multimedia Retrieval 8 Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) They are heuristic methods (i.e., no quality guarantee) or rely on some hidden characteristics of the data

9 LSH is the best approach using sub-quadratic space Locality Sensitive Hashing (LSH) 9 Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket h(q), verify if o = q LSH h LSH-family, Pr[ h(q) = h(o) ] 1/Dist(q, o) h :: R d Zı technically, dependent on r Near-by points (blue) have more chance of colliding with q than faraway points (red) cr q r a b d-dimensional space

10 10 Const Succ. Prob. LSH: Indexing & Query Processing Indexing For a fixed r sig(o) = h 1 (o), h 2 (o),, h k (o) store o into bucket sig(o) Iteratively increase r Query Processing Search with a fixed r Retrieve and verify points in the bucket sig(q) (r, c*r)-nn problem Repeat this L times (boosting) Galloping search to find the first good r Incurs additional cost + only c 2 quality guarantee Control false positive size cr q r a b d-dimensional space

11 Locality Sensitive Hashing (LSH) 11 Standard LSH c 2 -ANN many R(c i, c i+1 )-NN problems Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions Main problem: Huge super-linear index size: Õ(log(dist NN ) n 1+ +o(1) ) Typically hundreds of times larger than the data size

12 Dealing with Large Data /1 12 Approach 1: A single hash table but probing more buckets Multi-probe [VLDB 07] : Princeton Probe ±1, ±2, buckets from sig(q) Heuristic method; sensitive to parameter settings Entropy LSH [SODA 06] : Stanford Create multiple perturbed queries q i from q Probe buckets of sig(q i ) O(n 2/c ) perturbed queries with O(1) hash tables

13 Dealing with Large Data /2 13 Approach 2: Distributed LSH Simple distributed LSH: Use (key, value) store to implement (distributed) hash tables Key machine ID Network traffic O(n (1) ) Layered LSH [CIKM 12] : Stanford Use (h(key), key + value), where h() is a LSH Network traffic O(log(n) 0.5 ) w.h.p. 13 nodes for 64-dim color histograms of 80 million Tiny Image

14 Dealing with Large Data /3 14 Approach 3: Parallel LSH PLSH [VLDB13] : MIT + Intel Partition data to each node; broadcast the query and each node process data locally Meticulous hardware-oriented optimizations Fit the indexes into main memory at each node SIMD, Cache-friendly bitmaps, 2-level hashing for TLB optimization, other optimizations for streaming data Build predicative performance models for parameter tuning 100 servers (64 GB memory) for 1B steaming tweets and queries

15 Dealing with Large Data /4 15 Approach 4: LSH on external memory with index size reduction LSB-forest [SIGMOD 09, TODS 10] : A different reduction from c 2 -ANN to a R(c i, c i+1 )-NN problem C2LSH [SIGMOD 12] : Do not use composite hash keys Perform fine-granular counting number of collisions in m LSH projections O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O(n*log(n)/B) query, O(n*log(n)/B) space Small constants in O( ) SRS (Ours) O(n/B) query, O(n/B) space

16 Weakness of Ext-Memory LSH Methods 16 Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval Can only handle c, where c = x 2, for integer x 2 To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) Dataset Size LSB-forest C2LSH + SRS (Ours) Audio, 40MB 1500 MB 127 MB 2 MB

17 SRS: Our Proposed Method 17 Solving c-ann queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective Advantages: Small index Rich-functionality Simple Central idea: c-ann query in d dims knn query in m-dims with filtering Model the distribution of m stable random projections

18 2-stable Random Projection 18 Let D be the 2-stable random projection = standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*a + y*b ~ (x 2 +y 2 ) 1/2 * Dı Illustration V r 1 V V.r 1 ~ N(0, ǁvǁ) r 2

19 Dist(O) and ProjDist(O) and 19 Their Relationship ProjDist(O) Proj(O) O O Proj(Q) Q V.r 1 ~ N(0, ǁvǁ) z 1 V, r 1 ~ N(0, ǁvǁ) z 2 V, r 2 ~ N(0, ǁvǁ) r 1 Q Think of z i as samples O in d dims m 2-stable random projections (z 1, z m ) in m dims z 12 +z 2 2 ~ ǁvǁ 2 * 2 m i.e., scaled Chi-squared distribution of m degrees of freedom m (x): cdf of the standard 2 m distribution r 2 Dist(O) ProjDist(O)

20 LSH-like Property 20 Intuitive idea: If Dist(O 1 ) Dist(O 2 ) then ProjDist(O 1 ) < ProjDist(O 2 ) with high probability But the inverse is NOT true NN object in the projected space is most likely not the NN object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases (say T) Solution Perform incremental k-nn search on the projected space till accessing T objects + Early termination test

21 Indexing 21 Finding the minimum m Input n, c, T max # of points to access by the algorithm Output m : # of 2-stable random projections T T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections n projected points in a m-dimensional space Index these projections using any index that supports incremental knn search, e.g., R-tree Space cost: O(m * n) = O(n)

22 22 SRS- (T, c, p ) Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Early-termination test: c 2 *ProjDist 2 (o ψ k ) m > p Dist 2 τ (o min ) // stopping condition // stopping condition Return O min Main Theorem: c = 4, d = 256, m = 6, T = n, B = 1024, p =0.18 Index = n, Query = n, succ prob = 0.13 SRS- returns a c-nn point with The 3rd probability China-Australia p Workshop -f(m,c) with O(n) I/O cost

23 Variations of SRS- (T, c, p ) 23 Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Return O min // stopping condition // stopping condition 1. SRS- 2. SRS- 3. SRS- (T, c, p ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) All with success The 3rd probability China-Australia at Workshop least p

24 Other Results 24 Uniformly optimal method for L 2 Can be easily extended to support top-k c-ann queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least p, if SRS- stops due to earlytermination condition 100% in practice (97% in theory)

25 Experiment Setup 25 Algorithms LSB-forest [SIGMOD 09, TODS 10] C2LSH [SIGMOD 12] SRS-* [VLDB 15] Data Measures Index size, query cost, result quality, success probability

26 Datasets GB 12.5PB 820GB 37GB

27 Tiny Image Dataset (8M pts, 384 dims) 27 Fastest: SRS-, Slowest: C2LSH Quality the other way around SRS- has comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSBforest

28 Approximate Nearest Neighbor 28 Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7%

29 Large Dataset (1 Billion, 128 dims) 29 Works on a PC (4G mem, 500G hard disk)

30 Theoretical Guarantees Are Important Success probability in a hard setting

31 Summary 31 Index size is the bottleneck for traditional LSH-based methods Limits the data size one node can support Needs many nodes with large memory to scale to billions of data points SRS reduces c-ann queries in arbitrarily high dimensional space to knn queries in low dim space Our index size is approximately d/m of the size of the data file Can be easily generalized to an parallel implementation Ongoing work Scaling to even larger dataset using data partitioning Accommodate rapid updates Other theoretical work for LSH

32 Q&A 32 Similarity Query Processing Project Homepage:

33 33

34 Backup Slides: Analysis 34

35 Stopping Condition 35 near point: the NN point its distance r far points: points whose distance > c * r Then for any > 0 and any o: Pr[ProjDist(o) *r o is a near point] m ( 2 ) Pr[ProjDist(o) *r o is a far point] m ( 2 /c 2 ) Both because ProjDist 2 (o)/dist 2 (o) ~ 2 m Pr[the NN point projected before *r] m ( 2 ) Pr[# of bad points projected before *r < T] > (1 - m ( 2 /c 2 )) * (n/t) Choose such that P1 + P2 1 > 0 P1 P2 Feasible due to good concentration bound for 2 m

36 Choosing 36 Probability (yaxis) of Seeing an Object at Distance (xaxis) Probability Object with distance 1 Object with distance 4 Let c = 4 Mode = m 2 Blue: 4 Red: 4*(c 2 ) = 64 *r Distance

37 ProjDist(O T ): Case I Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 37 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = the NN point

38 ProjDist(O T ): Case II Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 38 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = a cnn point

39 Early-termination Condition ( ) 39 Omit the proof here Also relies on the fact that the squared sum of m projected distances follows a scaled 2 m distribution Key to Handle the case where c = 1 Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods Guarantees the correctness of top-k cann points returned when stopped by this condition No such guarantee by any previous method

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

4 Locality-sensitive hashing using stable distributions

4 Locality-sensitive hashing using stable distributions 4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Tailored Bregman Ball Trees for Effective Nearest Neighbors Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia

More information

Lecture 17 03/21, 2017

Lecture 17 03/21, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

More information

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra

More information

Nearest Neighbor Search with Keywords

Nearest Neighbor Search with Keywords Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013 In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g.,

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

High-Dimensional Indexing by Distributed Aggregation

High-Dimensional Indexing by Distributed Aggregation High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering

More information

Ch. 10 Vector Quantization. Advantages & Design

Ch. 10 Vector Quantization. Advantages & Design Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility

More information

Configuring Spatial Grids for Efficient Main Memory Joins

Configuring Spatial Grids for Efficient Main Memory Joins Configuring Spatial Grids for Efficient Main Memory Joins Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki École Polytechnique Fédérale de Lausanne (EPFL), Imperial College London Abstract. The performance

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database. Ahmad Alhilal, Dimitris Tsaras Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information

More information

Multi-Approximate-Keyword Routing Query

Multi-Approximate-Keyword Routing Query Bin Yao 1, Mingwang Tang 2, Feifei Li 2 1 Department of Computer Science and Engineering Shanghai Jiao Tong University, P. R. China 2 School of Computing University of Utah, USA Outline 1 Introduction

More information

Spectral Hashing: Learning to Leverage 80 Million Images

Spectral Hashing: Learning to Leverage 80 Million Images Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Similarity Search in High Dimensions II. Piotr Indyk MIT

Similarity Search in High Dimensions II. Piotr Indyk MIT Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

6.854 Advanced Algorithms

6.854 Advanced Algorithms 6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using

More information

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080

More information

Faster Algorithms for Sparse Fourier Transform

Faster Algorithms for Sparse Fourier Transform Faster Algorithms for Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price MIT Material from: Hassanieh, Indyk, Katabi, Price, Simple and Practical Algorithms for Sparse Fourier

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively. Combine solutions to sub-problems into overall solution. Most common

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

cient and E ective Similarity Search based on Earth Mover s Distance

cient and E ective Similarity Search based on Earth Mover s Distance 36th International Conference on Very Large Data Bases E cient and E ective Similarity Search over Probabilistic Data based on Earth Mover s Distance Jia Xu 1, Zhenjie Zhang 2, Anthony K.H. Tung 2, Ge

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

ECE 592 Topics in Data Science

ECE 592 Topics in Data Science ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Analysis of Algorithms. Unit 5 - Intractable Problems

Analysis of Algorithms. Unit 5 - Intractable Problems Analysis of Algorithms Unit 5 - Intractable Problems 1 Intractable Problems Tractable Problems vs. Intractable Problems Polynomial Problems NP Problems NP Complete and NP Hard Problems 2 In this unit we

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Announcements. CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis. Today. Mathematical induction. Dan Grossman Spring 2010

Announcements. CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis. Today. Mathematical induction. Dan Grossman Spring 2010 Announcements CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis Dan Grossman Spring 2010 Project 1 posted Section materials on using Eclipse will be very useful if you have never used

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com

More information

Faster Algorithms for Sparse Fourier Transform

Faster Algorithms for Sparse Fourier Transform Faster Algorithms for Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price MIT Material from: Hassanieh, Indyk, Katabi, Price, Simple and Practical Algorithms for Sparse Fourier

More information

Lin-Kernighan Heuristic. Simulated Annealing

Lin-Kernighan Heuristic. Simulated Annealing DM63 HEURISTICS FOR COMBINATORIAL OPTIMIZATION Lecture 6 Lin-Kernighan Heuristic. Simulated Annealing Marco Chiarandini Outline 1. Competition 2. Variable Depth Search 3. Simulated Annealing DM63 Heuristics

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

Graph Analysis Using Map/Reduce

Graph Analysis Using Map/Reduce Seminar: Massive-Scale Graph Analysis Summer Semester 2015 Graph Analysis Using Map/Reduce Laurent Linden s9lalind@stud.uni-saarland.de May 14, 2015 Introduction Peta-Scale Graph + Graph mining + MapReduce

More information

CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis. Hunter Zahn Summer 2016

CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis. Hunter Zahn Summer 2016 CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis Hunter Zahn Summer 2016 Today Finish discussing stacks and queues Review math essential to algorithm analysis Proof by

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Nearest Neighbor Search in Spatial

Nearest Neighbor Search in Spatial Nearest Neighbor Search in Spatial and Spatiotemporal Databases 1 Content 1. Conventional NN Queries 2. Reverse NN Queries 3. Aggregate NN Queries 4. NN Queries with Validity Information 5. Continuous

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

P, NP, NP-Complete, and NPhard

P, NP, NP-Complete, and NPhard P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

arxiv: v1 [cs.db] 2 Sep 2014

arxiv: v1 [cs.db] 2 Sep 2014 An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Binary Decision Diagrams and Symbolic Model Checking

Binary Decision Diagrams and Symbolic Model Checking Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of

More information

Similarity Join Size Estimation using Locality Sensitive Hashing

Similarity Join Size Estimation using Locality Sensitive Hashing Similarity Join Size Estimation using Locality Sensitive Hashing Hongrae Lee, Google Inc Raymond Ng, University of British Columbia Kyuseok Shim, Seoul National University Highly Similar, but not Identical,

More information

A Tight Lower Bound for Dynamic Membership in the External Memory Model

A Tight Lower Bound for Dynamic Membership in the External Memory Model A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model

More information

Proximity problems in high dimensions

Proximity problems in high dimensions Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Privacy-Preserving Data Mining

Privacy-Preserving Data Mining CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,

More information

SPATIAL INDEXING. Vaibhav Bajpai

SPATIAL INDEXING. Vaibhav Bajpai SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Making Nearest

More information

TASM: Top-k Approximate Subtree Matching

TASM: Top-k Approximate Subtree Matching TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information