LOCALITY SENSITIVE HASHING FOR BIG DATA
|
|
- Kelley Dawson
- 5 years ago
- Views:
Transcription
1 1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales
2 Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions
3 NN and c-ann Queries 3 c-ann = x aka. (1+ )-ANN cr q r a b x D = {a, b, x} NN = a d-dimensional space A set of points, D = i=1 n {O i }, in d-dimensional Euclidean space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version: Return a c-ann point: i.e., its distance to q is at most c*dist(o*, q) May return a c-ann point with at least constant probability
4 Motivations 4 Theoretical interest Fundamental geometric problem: post-office problem Known to be hard due to high-dimensionality Many applications Feature vectors: Data Mining, Multimedia DB Applications based on similarity queries E.g., Quantization in coding/compression Recommendation systems Bioinformatics/Chemoinformatics
5 Applications 5 SIFT feature matching Image retrieval, understanding, scene-rendering Video retrieval
6 Challenges 6 Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial in d Approximate version (c-ann) is not much easier More details next slide Large data size 1KB for a single point with 256 dims 1B pts = 1TB image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors High-dimensionality finding similar words in D=198 million tweets [JMLR 2013]
7 Existing Solutions 7 NN: Linear scan is (practically) the best approach using linear space & time O(d 5 log(n)) query time, O(n 2d + ) space O(dn 1- (d) ) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space (1+ )-ANN LSH is the best approach using sub-quadratic space O(log(n) + 1/ (d-1)/2 ) query time, O(n*log(1/ )) space Probabilistic test remove exponential dependency on d Fast JLT: O(d*log(d) + -3 log 2 (n)) query time, O(n max(2, ^-2) ) space LSH-based: Õ(dn +o(1) ) query time, Õ(n 1+ +o(1) + nd) space = 1/(1+ ) + o c (1)
8 Approximate NN for Multimedia Retrieval 8 Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) They are heuristic methods (i.e., no quality guarantee) or rely on some hidden characteristics of the data
9 LSH is the best approach using sub-quadratic space Locality Sensitive Hashing (LSH) 9 Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket h(q), verify if o = q LSH h LSH-family, Pr[ h(q) = h(o) ] 1/Dist(q, o) h :: R d Zı technically, dependent on r Near-by points (blue) have more chance of colliding with q than faraway points (red) cr q r a b d-dimensional space
10 10 Const Succ. Prob. LSH: Indexing & Query Processing Indexing For a fixed r sig(o) = h 1 (o), h 2 (o),, h k (o) store o into bucket sig(o) Iteratively increase r Query Processing Search with a fixed r Retrieve and verify points in the bucket sig(q) (r, c*r)-nn problem Repeat this L times (boosting) Galloping search to find the first good r Incurs additional cost + only c 2 quality guarantee Control false positive size cr q r a b d-dimensional space
11 Locality Sensitive Hashing (LSH) 11 Standard LSH c 2 -ANN many R(c i, c i+1 )-NN problems Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions Main problem: Huge super-linear index size: Õ(log(dist NN ) n 1+ +o(1) ) Typically hundreds of times larger than the data size
12 Dealing with Large Data /1 12 Approach 1: A single hash table but probing more buckets Multi-probe [VLDB 07] : Princeton Probe ±1, ±2, buckets from sig(q) Heuristic method; sensitive to parameter settings Entropy LSH [SODA 06] : Stanford Create multiple perturbed queries q i from q Probe buckets of sig(q i ) O(n 2/c ) perturbed queries with O(1) hash tables
13 Dealing with Large Data /2 13 Approach 2: Distributed LSH Simple distributed LSH: Use (key, value) store to implement (distributed) hash tables Key machine ID Network traffic O(n (1) ) Layered LSH [CIKM 12] : Stanford Use (h(key), key + value), where h() is a LSH Network traffic O(log(n) 0.5 ) w.h.p. 13 nodes for 64-dim color histograms of 80 million Tiny Image
14 Dealing with Large Data /3 14 Approach 3: Parallel LSH PLSH [VLDB13] : MIT + Intel Partition data to each node; broadcast the query and each node process data locally Meticulous hardware-oriented optimizations Fit the indexes into main memory at each node SIMD, Cache-friendly bitmaps, 2-level hashing for TLB optimization, other optimizations for streaming data Build predicative performance models for parameter tuning 100 servers (64 GB memory) for 1B steaming tweets and queries
15 Dealing with Large Data /4 15 Approach 4: LSH on external memory with index size reduction LSB-forest [SIGMOD 09, TODS 10] : A different reduction from c 2 -ANN to a R(c i, c i+1 )-NN problem C2LSH [SIGMOD 12] : Do not use composite hash keys Perform fine-granular counting number of collisions in m LSH projections O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O(n*log(n)/B) query, O(n*log(n)/B) space Small constants in O( ) SRS (Ours) O(n/B) query, O(n/B) space
16 Weakness of Ext-Memory LSH Methods 16 Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval Can only handle c, where c = x 2, for integer x 2 To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) Dataset Size LSB-forest C2LSH + SRS (Ours) Audio, 40MB 1500 MB 127 MB 2 MB
17 SRS: Our Proposed Method 17 Solving c-ann queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective Advantages: Small index Rich-functionality Simple Central idea: c-ann query in d dims knn query in m-dims with filtering Model the distribution of m stable random projections
18 2-stable Random Projection 18 Let D be the 2-stable random projection = standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*a + y*b ~ (x 2 +y 2 ) 1/2 * Dı Illustration V r 1 V V.r 1 ~ N(0, ǁvǁ) r 2
19 Dist(O) and ProjDist(O) and 19 Their Relationship ProjDist(O) Proj(O) O O Proj(Q) Q V.r 1 ~ N(0, ǁvǁ) z 1 V, r 1 ~ N(0, ǁvǁ) z 2 V, r 2 ~ N(0, ǁvǁ) r 1 Q Think of z i as samples O in d dims m 2-stable random projections (z 1, z m ) in m dims z 12 +z 2 2 ~ ǁvǁ 2 * 2 m i.e., scaled Chi-squared distribution of m degrees of freedom m (x): cdf of the standard 2 m distribution r 2 Dist(O) ProjDist(O)
20 LSH-like Property 20 Intuitive idea: If Dist(O 1 ) Dist(O 2 ) then ProjDist(O 1 ) < ProjDist(O 2 ) with high probability But the inverse is NOT true NN object in the projected space is most likely not the NN object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases (say T) Solution Perform incremental k-nn search on the projected space till accessing T objects + Early termination test
21 Indexing 21 Finding the minimum m Input n, c, T max # of points to access by the algorithm Output m : # of 2-stable random projections T T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections n projected points in a m-dimensional space Index these projections using any index that supports incremental knn search, e.g., R-tree Space cost: O(m * n) = O(n)
22 22 SRS- (T, c, p ) Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Early-termination test: c 2 *ProjDist 2 (o ψ k ) m > p Dist 2 τ (o min ) // stopping condition // stopping condition Return O min Main Theorem: c = 4, d = 256, m = 6, T = n, B = 1024, p =0.18 Index = n, Query = n, succ prob = 0.13 SRS- returns a c-nn point with The 3rd probability China-Australia p Workshop -f(m,c) with O(n) I/O cost
23 Variations of SRS- (T, c, p ) 23 Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Return O min // stopping condition // stopping condition 1. SRS- 2. SRS- 3. SRS- (T, c, p ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) All with success The 3rd probability China-Australia at Workshop least p
24 Other Results 24 Uniformly optimal method for L 2 Can be easily extended to support top-k c-ann queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least p, if SRS- stops due to earlytermination condition 100% in practice (97% in theory)
25 Experiment Setup 25 Algorithms LSB-forest [SIGMOD 09, TODS 10] C2LSH [SIGMOD 12] SRS-* [VLDB 15] Data Measures Index size, query cost, result quality, success probability
26 Datasets GB 12.5PB 820GB 37GB
27 Tiny Image Dataset (8M pts, 384 dims) 27 Fastest: SRS-, Slowest: C2LSH Quality the other way around SRS- has comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSBforest
28 Approximate Nearest Neighbor 28 Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7%
29 Large Dataset (1 Billion, 128 dims) 29 Works on a PC (4G mem, 500G hard disk)
30 Theoretical Guarantees Are Important Success probability in a hard setting
31 Summary 31 Index size is the bottleneck for traditional LSH-based methods Limits the data size one node can support Needs many nodes with large memory to scale to billions of data points SRS reduces c-ann queries in arbitrarily high dimensional space to knn queries in low dim space Our index size is approximately d/m of the size of the data file Can be easily generalized to an parallel implementation Ongoing work Scaling to even larger dataset using data partitioning Accommodate rapid updates Other theoretical work for LSH
32 Q&A 32 Similarity Query Processing Project Homepage:
33 33
34 Backup Slides: Analysis 34
35 Stopping Condition 35 near point: the NN point its distance r far points: points whose distance > c * r Then for any > 0 and any o: Pr[ProjDist(o) *r o is a near point] m ( 2 ) Pr[ProjDist(o) *r o is a far point] m ( 2 /c 2 ) Both because ProjDist 2 (o)/dist 2 (o) ~ 2 m Pr[the NN point projected before *r] m ( 2 ) Pr[# of bad points projected before *r < T] > (1 - m ( 2 /c 2 )) * (n/t) Choose such that P1 + P2 1 > 0 P1 P2 Feasible due to good concentration bound for 2 m
36 Choosing 36 Probability (yaxis) of Seeing an Object at Distance (xaxis) Probability Object with distance 1 Object with distance 4 Let c = 4 Mode = m 2 Blue: 4 Red: 4*(c 2 ) = 64 *r Distance
37 ProjDist(O T ): Case I Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 37 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = the NN point
38 ProjDist(O T ): Case II Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 38 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = a cnn point
39 Early-termination Condition ( ) 39 Omit the proof here Also relies on the fact that the squared sum of m projected distances follows a scaled 2 m distribution Key to Handle the case where c = 1 Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods Guarantees the correctness of top-k cann points returned when stopped by this condition No such guarantee by any previous method
CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More information4 Locality-sensitive hashing using stable distributions
4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationComposite Quantization for Approximate Nearest Neighbor Search
Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from
More informationWolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia
More informationLecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.
COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is
More informationDot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is
More informationTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia
More informationLecture 17 03/21, 2017
CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationCarnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem
Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra
More informationNearest Neighbor Search with Keywords
Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013 In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g.,
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationLocality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationHigh-Dimensional Indexing by Distributed Aggregation
High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering
More informationCh. 10 Vector Quantization. Advantages & Design
Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility
More informationConfiguring Spatial Grids for Efficient Main Memory Joins
Configuring Spatial Grids for Efficient Main Memory Joins Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki École Polytechnique Fédérale de Lausanne (EPFL), Imperial College London Abstract. The performance
More informationMetric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy
More informationSpatial Database. Ahmad Alhilal, Dimitris Tsaras
Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information
More informationMulti-Approximate-Keyword Routing Query
Bin Yao 1, Mingwang Tang 2, Feifei Li 2 1 Department of Computer Science and Engineering Shanghai Jiao Tong University, P. R. China 2 School of Computing University of Utah, USA Outline 1 Introduction
More informationSpectral Hashing: Learning to Leverage 80 Million Images
Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More informationLarge-Scale Behavioral Targeting
Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationSimilarity Search in High Dimensions II. Piotr Indyk MIT
Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More information6.854 Advanced Algorithms
6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using
More informationStar-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory
Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080
More informationFaster Algorithms for Sparse Fourier Transform
Faster Algorithms for Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price MIT Material from: Hassanieh, Indyk, Katabi, Price, Simple and Practical Algorithms for Sparse Fourier
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University.
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationAdvanced Implementations of Tables: Balanced Search Trees and Hashing
Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the
More informationCopyright 2000, Kevin Wayne 1
Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively. Combine solutions to sub-problems into overall solution. Most common
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationcient and E ective Similarity Search based on Earth Mover s Distance
36th International Conference on Very Large Data Bases E cient and E ective Similarity Search over Probabilistic Data based on Earth Mover s Distance Jia Xu 1, Zhenjie Zhang 2, Anthony K.H. Tung 2, Ge
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationECE 592 Topics in Data Science
ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationAnalysis of Algorithms. Unit 5 - Intractable Problems
Analysis of Algorithms Unit 5 - Intractable Problems 1 Intractable Problems Tractable Problems vs. Intractable Problems Polynomial Problems NP Problems NP Complete and NP Hard Problems 2 In this unit we
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationAnnouncements. CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis. Today. Mathematical induction. Dan Grossman Spring 2010
Announcements CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis Dan Grossman Spring 2010 Project 1 posted Section materials on using Eclipse will be very useful if you have never used
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com
More informationFaster Algorithms for Sparse Fourier Transform
Faster Algorithms for Sparse Fourier Transform Haitham Hassanieh Piotr Indyk Dina Katabi Eric Price MIT Material from: Hassanieh, Indyk, Katabi, Price, Simple and Practical Algorithms for Sparse Fourier
More informationLin-Kernighan Heuristic. Simulated Annealing
DM63 HEURISTICS FOR COMBINATORIAL OPTIMIZATION Lecture 6 Lin-Kernighan Heuristic. Simulated Annealing Marco Chiarandini Outline 1. Competition 2. Variable Depth Search 3. Simulated Annealing DM63 Heuristics
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationGraph Analysis Using Map/Reduce
Seminar: Massive-Scale Graph Analysis Summer Semester 2015 Graph Analysis Using Map/Reduce Laurent Linden s9lalind@stud.uni-saarland.de May 14, 2015 Introduction Peta-Scale Graph + Graph mining + MapReduce
More informationCSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis. Hunter Zahn Summer 2016
CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis Hunter Zahn Summer 2016 Today Finish discussing stacks and queues Review math essential to algorithm analysis Proof by
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationNearest Neighbor Search in Spatial
Nearest Neighbor Search in Spatial and Spatiotemporal Databases 1 Content 1. Conventional NN Queries 2. Reverse NN Queries 3. Aggregate NN Queries 4. NN Queries with Validity Information 5. Continuous
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationChapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.
Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should
More informationP, NP, NP-Complete, and NPhard
P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationLecture 8 HASHING!!!!!
Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)
More informationarxiv: v1 [cs.db] 2 Sep 2014
An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationBinary Decision Diagrams and Symbolic Model Checking
Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of
More informationSimilarity Join Size Estimation using Locality Sensitive Hashing
Similarity Join Size Estimation using Locality Sensitive Hashing Hongrae Lee, Google Inc Raymond Ng, University of British Columbia Kyuseok Shim, Seoul National University Highly Similar, but not Identical,
More informationA Tight Lower Bound for Dynamic Membership in the External Memory Model
A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model
More informationProximity problems in high dimensions
Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationMining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationPrivacy-Preserving Data Mining
CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,
More informationSPATIAL INDEXING. Vaibhav Bajpai
SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationMaking Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI
Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Making Nearest
More informationTASM: Top-k Approximate Subtree Matching
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More information