Efficient Data Reduction and Summarization

Similar documents
Probabilistic Hashing for Efficient Search and Learning

One Permutation Hashing for Efficient Search and Learning

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Hashing Algorithms for Large-Scale Learning

Nystrom Method for Approximating the GMM Kernel

Ad Placement Strategies

arxiv: v1 [stat.ml] 6 Jun 2011

Introduction to Logistic Regression

Nearest Neighbor Classification Using Bottom-k Sketches

PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

ABC-LogitBoost for Multi-Class Classification

Randomized Algorithms

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Minwise hashing for large-scale regression and classification with sparse data

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

A New Space for Comparing Graphs

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Online Advertising is Big Business

CS60021: Scalable Data Mining. Large Scale Machine Learning

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CSC321 Lecture 7: Optimization

COMS 4771 Regression. Nakul Verma

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Cryptographic Hash Functions

CSC321 Lecture 8: Optimization

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Training Support Vector Machines: Status and Challenges

Decoupled Collaborative Ranking

Introduction to Randomized Algorithms III

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Matrix Assembly in FEA

Comparison of Modern Stochastic Optimization Algorithms

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Probabilistic Near-Duplicate. Detection Using Simhash

Applied Machine Learning Annalisa Marsico

Click Prediction and Preference Ranking of RSS Feeds

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Sign Cauchy Projections and Chi-Square Kernel

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Singular Value Decompsition

Sample questions for Fundamentals of Machine Learning 2018

Lecture 7 Artificial neural networks: Supervised learning

Large-Scale Behavioral Targeting

Hash-based Indexing: Application, Impact, and Realization Alternatives

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CS6220: DATA MINING TECHNIQUES

Introduction to Support Vector Machines

Lecture 4: Perceptrons and Multilayer Perceptrons

Machine Learning Basics

Linear classifiers: Overfitting and regularization

Random Features for Large Scale Kernel Machines

Large-scale machine learning Stochastic gradient descent

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

18.9 SUPPORT VECTOR MACHINES

Functions and Data Fitting

Algorithms for Data Science: Lecture on Finding Similar Items

Slides based on those in:

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Introduction to Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Models, Data, Learning Problems

Matrix Factorization Techniques for Recommender Systems

Topic 3: Neural Networks

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Face detection and recognition. Detection Recognition Sally

Machine Learning Linear Classification. Prof. Matteo Matteucci

Sparse Kernel Machines - SVM

Fast Hierarchical Clustering from the Baire Distance

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

ECE521 Lectures 9 Fully Connected Neural Networks

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

b-bit Minwise Hashing

ECE521 week 3: 23/26 January 2017

Online Advertising is Big Business

Composite Quantization for Approximate Nearest Neighbor Search

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

Transcription:

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information Science ornell University December 8, 2011

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 2 Hashing Algorithms for Large-Scale Learning Ping Li Department of Statistical Science Faculty of omputing and Information Science ornell University References 1. Ping Li, Anshumali Shrivastava, and hristian König, Training Logistic Regression and SVM on 200GB Data Using b-bit Minwise Hashing and omparisons with Vowpal Wabbit (VW), Technical Reports, 2011 2. Ping Li, Anshumali Shrivastava, Joshua Moore, and hristian König, Hashing Algorithms for Large-Scale Learning, NIPS 2011 3. Ping Li and hristian König, Theory and Applications of b-bit Minwise Hashing, Research Highlight Article in ommunications of AM, August 2011 4. Ping Li, hristian König, and Wenhao Gui, b-bit Minwise Hashing for Estimating Three-Way Similarities, NIPS 2010 5. Ping Li, hristian König, b-bit Minwise Hashing, WWW 2010

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 3 How Large Are the Datasets? onceptually, consider the data as a matrix of size n D. In modern applications, # examples n = 10 6 is common and n = 10 9 is not rare, for example, images, documents, spams, social and search click-through data. In a sense, click-through data can have unlimited # examples. Very high-dimensional (image, text, biological) data are common, e.g., D = 10 6 (million), D = 10 9 (billion), D = 2 (trillion), D = 2 64 or even higher. In a sense, D can be made arbitrarily higher by taking into account pairwise, 3-way or higher interactions.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 4 Massive, High-dimensional, Sparse, and Binary Data Binary sparse data are very common in the real-world: For many applications (such as text), binary sparse data are very natural. Many datasets can be quantized/thresholded to be binary without hurting the prediction accuracy. In some cases, even when the original data are not too sparse, they often become sparse when considering pariwise and higher-order interactions.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 5 An Example of Binary (0/1) Sparse Massive Data A set S Ω = {0, 1,..., D 1} can be viewed as 0/1 vector in D dimensions. Shingling: Each document (Web page) can be viewed as a set of w-shingles. For example, after parsing, a sentence today is a nice day becomes w = 1: w = 2: w = 3: { today, is, a, nice, day } { today is, is a, a nice, nice day } { today is a, is a nice, a nice day } Previous studies used w 5, as single-word (unit-gram) model is not sufficient. Shingling generates extremely high dimensional vectors, e.g., D = (10 5 ) w. (10 5 ) 5 = 5 = 2 83, although in current practice, it seems D = 2 64 suffices.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 6 hallenges of Learning with Massive High-dimensional Data The data often can not fit in memory (even when the data are sparse). Data loading takes too long. For example, online algorithms require less memory but often need to iterate over the data for several (or many) passes to achieve sufficient accuracies. Data loading in general dominates the cost. Training can be very expensive, even for simple linear models such as logistic regression and linear SVM. Testing may be too slow to meet the demand. This is especially important for applications in search engines or interactive data visual analytics. The model itself can be too large to store, for example, we can not really store a vector of weights for fitting logistic regression on data of 2 64 dimensions.

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 2 14 16 2 4 6 8 2 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 2 14 16 2 4 6 8 2 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 2 14 16 2 4 6 8 2 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 2 14 16 2 4 6 8 2 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 2 14 16 2 4 6 8 2 14 16 Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 7 Another Example of Binary Massive Sparse Data The zipcode recognition task can be formulated as a 10-class classification problem. The well-known MNIST8m dataset contains 8 million 28 28 small images. We try to use linear SVM (perhaps the fastest classification algorithm). Features for linear SVM 1. Original pixel intensities. 784(= 28 28) dimensions. 2. All overlapping 2 2 (or higher) patches (absent/present). Binary in 12448 dimensions. (28 1) 2 2 2 2 + 784 = 12448. 3. All pairwise on top of 2 2 patches. Binary in 77,482,576 dimensions. 12448 (12448 1)/2 + 12448 = 77482576.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 8 Linear SVM experiments on the first 20,000 examples (MNIST20k) The first 10,000 for training and the remaining 10,000 for testing. Test accuracy (%) All pairwise of 2 2 95 2 2 Original 85 MNIST20k: Accuracy 75 10 2 10 1 is the usual regularization parameter in linear SVM (the only parameter).

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 9 3 3: all overlapping 3 2 patches + all 2 2 patches 4 4: all overlapping 4 4 patches +... Test accuracy (%) 95 3 3 4 4 2 2 85 MNIST20k: Accuracy 75 10 2 10 1 Increasing features does not help much, possibly because # examples is small.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 10 Linear SVM experiments on the first 1,000,000 examples (MNIST1m) The first 500,000 for training and the remaining 500,000 for testing. Test accuracy (%) 95 85 4 4 3 3 2 2 Original MNIST1m: Accuracy 75 10 2 10 1 All pairwise on top of 2 2 patches would require > 1TB storage on MNIST1m. Remind these are merely tiny images to start with.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 11 A Popular Solution Based on Normal Random Projections Random Projections: Replace original data matrix A by B = A R A R = B R R D k : a random matrix, with i.i.d. entries sampled from N(0, 1). B R n k : projected matrix, also random. B approximately preserves the Euclidean distance and inner products between any two rows of A. In particular, E (BB T ) = AA T.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 12 Using Random Projections for Linear Learning The task is straightforward. A R = B Since the new (projected) data B preserve the inner product of the original data A, we can simply feed B into (e.g.,) SVM or logistic regression solvers. These days, linear algorithms are extremely fast (if the data can fit in memory). However Random projections may not be good for inner products because the variance can be much much larger than b-bit minwise hashing.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 13 Notation for b-bit Minwise Hashing A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros). onsider two sets S 1, S 2 Ω = {0, 1, 2,..., D 1} (e.g., D = 2 64 ) f 1 a f 2 f 1 = S 1, f 2 = S 2, a = S 1 S 2. The resemblance R is a popular measure of set similarity R = S 1 S 2 S 1 S 2 = a f 1 + f 2 a

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 14 Minwise Hashing Suppose a random permutation π is performed on Ω, i.e., π : Ω Ω, where Ω = {0, 1,..., D 1}. An elementary probability argument shows that Pr (min(π(s 1 )) = min(π(s 2 ))) = S 1 S 2 S 1 S 2 = R.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 15 An Example D = 5. S 1 = {0, 3, 4}, S 2 = {1, 2, 3}, R = S 1 S 2 S 1 S 2 = 1 5. One realization of the permutation π can be 0 = 3 1 = 2 2 = 0 3 = 4 4 = 1 π(s 1 ) = {3, 4, 1} = {1, 3, 4}, π(s 2 ) = {2, 0, 4} = {0, 2, 4} In this realization, min(π(s 1 )) min(π(s 2 )).

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 16 Minwise Hashing Estimator After k independent permutations, π 1, π 2,..., π k, one can estimate R without bias, as a binomial: ˆR M = 1 k k 1{min(π j (S 1 )) = min(π j (S 2 ))}, j=1 Var ( ˆRM ) = 1 R(1 R). k

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 17 Problems with Minwise Hashing Similarity search: high storage and computational cost In industry, each hashed value, e.g., min(π(s 1 )), is usually stored using 64 bits. Often k = 200 samples (hashed values) are needed. The total storage can be prohibitive: n 64 200 = 16 TB, if n = 0 (web pages). = computational problems, transmission problems, etc. Statistical learning: high dimensionality and model size We recently realize that minwise hashing can be used in statistical learning. However, 64-bit hashing leads to impractically high-dimensional model. b-bit minwise hashing naturally solves both problems!

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 18 The Basic Probability Result for b-bit Minwise Hashing onsider two sets, S 1 and S 2, S 1, S 2 Ω = {0, 1, 2,..., D 1}, f 1 = S 1, f 2 = S 2, a = S 1 S 2 Define the minimum values under π : Ω Ω to be z 1 and z 2 : and their b-bit versions z 1 = min (π (S 1 )), z 2 = min(π (S 2 )). z (b) 1 = The lowest b bits of z 1, z (b) 2 = The lowest b bits of z 2 Example: if z 1 = 7(= 111 in binary), then z (1) 1 = 1, z (2) 1 = 3.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 19 Theorem Assume D is large (which is virtually always true) P b = Pr ( ) z (b) 1 = z (b) 2 = 1,b + (1 2,b ) R Recall, (assuming infinite precision, or as many digits as needed), we have Pr (z 1 = z 2 ) = R

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 20 Theorem Assume D is large (which is virtually always true) P b = Pr ( ) z (b) 1 = z (b) 2 = 1,b + (1 2,b ) R r 1 = f 1 D, r 2 = f 2 D, f 1 = S 1, f 2 = S 2, 1,b = A 1,b r 2 r 1 + r 2 + A 2,b r 1 r 1 + r 2, 2,b = A 1,b r 1 r 1 + r 2 + A 2,b r 2 r 1 + r 2, A 1,b = r 1 [1 r 1 ] 2 b 1, A 2,b = r 2 2 [1 r 2 ] b 1. 1 [1 r 1 ] 2b 1 [1 r 2 ] 2b

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 21 The Variance-Space Trade-off Smaller b = smaller space for each sample. However, smaller b = larger variance. B(b; R, r 1, r 2 ) characterizes the var-space trade-off: ( ) B(b; R, r 1, r 2 ) = b Var ˆR b Lower B(b) is better.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 22 The ratio, B(b 1 ; R, r 1, r 2 ) B(b 2 ; R, r 1, r 2 ) measures the improvement of using b = b 2 (e.g., b 2 = 1) over using b = b 1 (e.g., b 1 = 64). B(64) B(b) = 20 means, to achieve the same accuracy (variance), using b = 64 bits per hashed value will require 20 times more space (in bits) than using.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 23 B(64) B(b), relative improvement of using, 2, 3, 4 bits, compared to 64 bits. Improvement Improvement 35 30 25 20 15 10 5 r 1 = r 2 = 10 10 b = 3 0 0 0.2 0.4 0.6 0.8 1 Resemblance (R) 60 r = r = 0.5 50 1 2 40 30 20 10 b = 3 0 0 0.2 0.4 0.6 0.8 1 Resemblance (R) Improvement Improvement 35 30 25 20 15 10 5 r 1 = r 2 = 0.1 b = 3 0 0 0.2 0.4 0.6 0.8 1 Resemblance (R) 60 50 40 30 20 b = 3 10 0 0 r 1 = r 2 = 0.9 0.2 0.4 0.6 0.8 Resemblance (R) 1

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 24 Relative Improvements in the Least Favorable Situation In the least favorable condition, the improvement would be B(64) B(1) = 64 R R + 1 If R = 0.5 (the commonly used similarity threshold), then the improvement will be 64 3 = 21.3-fold, which is large enough for industry. It turns out in terms of dimension reduction, the improvement is like 264 2 8 -fold.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 25 Experiments: Duplicate detection on Microsoft news articles The dataset was crawled as part of the BLEWS project at Microsoft. We computed pairwise resemblances for all documents and retrieved documents pairs with resemblance R larger than a threshold R 0. Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b=2 b=1 R 0 = 0.3 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k) Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 b=1 R 0 = 0.4 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k)

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 26 Using or 2 is almost as good as b = 64 especially for higher threshold R 0. Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 b=1 R 0 = 0.5 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k) Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 b=1 R 0 = 0.6 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k) Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 b=1 R 0 = 0.7 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k) Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 b=1 R 0 = 0.8 b=1 b=2 b=4 M 0 200 300 400 500 Sample size (k)

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 27 b-bit Minwise Hashing for Large-Scale linear Learning Linear learning algorithms require the estimators to be inner products. If we look carefully, the estimator of minwise hashing is indeed an inner product of two (extremely sparse) vectors in D k dimensions (infeasible when D = 2 64 ): ˆR M = 1 k k 1{min(π j (S 1 )) = min(π j (S 2 )) j=1 because, as z 1 = min(π(s 1 )), z 2 = min(π(s 2 )) Ω = {0, 1,..., D 1}. 1{z 1 = z 2 } = D 1 i=0 1{z 1 = i} {z 2 = i}

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 28 For example, D = 5, z 1 = 2, z 2 = 3, 1{z 1 = z 2 } = inner product between {0, 0, 1, 0, 0}, and {0, 1, 0, 0, 0}.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 29 Integrating b-bit Minwise Hashing for (Linear) Learning 1. We apply k independent random permutations on each (binary) feature vector x i and store the lowest b bits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk bits. 2. At run-time, we expand each new data point into a 2 b k-length vector, i.e. we concatenate the k vectors (each of length 2 b ). The new feature vector has exactly k 1 s.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 30 onsider an example of k = 3 and. For set (vector) S 1 : Original hashed values (k = 3) : 12013 25964 20191 Original binary representations : 010111011101101 110101101 111011011111 Lowest binary digits : 01 00 11 orresponding decimal values : 1 0 3 Expanded 2 binary digits : 00001 0 New feature vector fed to a solver : {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0} Then we apply the same permutations and procedures on sets S 2, S 3,...

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 31 Datasets and Solver for Linear Learning This talk only presents two text datasets. Experiments on image data (1TB or larger) will be presented in the future. Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test Webspam (24 GB) 350,000 16,609,143 3728 % / 20% Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50% We chose LIBLINEAR as the basic solver for linear learning. Note that our method is purely statistical/problilsitic and is independent of the underlying optimization procedures. All experiments were conducted on workstations with Xeon(R) PU (W55@3.33GHz) and 48GB RAM, under Windows 7 System.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 32 Experimental Results on Linear SVM and Webspam We conducted our extensive experiments for a wide range of regularization values (from 10 3 to ) with fine spacings in [0.1, 10]. We experimented with k = 30 to k = 500, and, 2, 4, 8, and 16.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 33 Testing Accuracy b = 6,8,10,16 98 96 94 4 92 88 86 84 svm: k = 200 82 Spam: Accuracy Solid: b-bit hashing. Dashed (red) the original data Using b 8 and k 200 achieves about the same test accuracies as using the original data. The results are averaged over 50 runs.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 34 98 96 94 92 88 86 84 82,10,16 b = 6 svm: k = 50 Spam: Accuracy b = 6,8,10,16 98 96 94 4 92 88 86 84 svm: k = 200 82 Spam: Accuracy,10,16 6 98 96 6 4 94 92 88 86 84 svm: k = 82 Spam: Accuracy b = 6,8,10,16 98 96 4 94 92 88 86 84 svm: k = 300 82 Spam: Accuracy b = 6,8,10,16 98 96 94 4 92 88 86 84 svm: k = 150 82 Spam: Accuracy b = 6,8,10,16 98 96 4 94 92 88 86 svm: k = 500 84 82 Spam: Accuracy

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 35 Stability of Testing Accuracy (Standard Deviation) Accuracy (std %) Accuracy (std %) Our method produces very stable results, especially b 4 10 1 10 2 svm: k = 50 Spam accuracy (std) b = 6 10 6 10 1 10 2 svm: k = 200 Spam accuracy (std) b = 6,10,16 Accuracy (std %) Accuracy (std %) 10 1 10 2 svm: k = Spam accuracy (std) b = 6 0,16 10 1 10 2 Spam:Accuracy (std) svm: k = 300 6 Accuracy (std %) Accuracy (std %) 10 1 10 2 svm: k = 150 Spam:Accuracy (std) 6 10 1 10 2 svm: k = 500 Spam accuracy (std) b = 6,8,10,16

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 36 svm: k = 50 Spam: Training time svm: k = 200 Spam: Training time 6 Training Time svm: k = Spam: Training time svm: k = 300 6 Spam:Training Time 2 16 4 b=8 svm: k = 150 6 2 4 16 b=8 Spam:Training Time svm: k = 500 Spam: Training time 6 0 They did not include data loading time (which is small for b-bit hashing) The original training time is about seconds. b-bit minwise hashing needs about 3 7 seconds (3 seconds when ).

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 37 Testing Time Testing time (sec) Testing time (sec) 0 10 0 svm: k = 50 Spam: Testing time 2 1 10 svm: k = 200 Spam: Testing time 2 1 Testing time (sec) Testing time (sec) 0 10 svm: k = Spam: Testing time 2 1 svm: k = 300 Spam:Testing Time Testing time (sec) Testing time (sec) svm: k = 150 Spam:Testing Time 0 10 svm: k = 500 Spam: Testing time 2 1

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 38 Experimental Results on L 2 -Regularized Logistic Regression Testing Accuracy 98 b = 6,10,16 96 94 92 88 86 84 82 logit: k = 50 Spam: Accuracy 98 96 94 92 88 86 84 82,10,16 logit: k = b = 6 Spam: Accuracy 95 85 logit: k = 150 Spam:Accuracy 98 b = 6,8,10,16 96 94 92 88 86 logit: k = 200 84 82 Spam: Accuracy 95 logit: k = 300 Spam:Accuracy 98 96 94 92 88 86 84 82 b = 6,8,10,16 4 logit: k = 500 Spam: Accuracy

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 39 Training Time logit: k = 50 Spam: Training time logit: k = 200 Spam: Training time logit: k = Spam: Training time 2 8 6 logit: k = 300 Spam:Training Time 2 4 6 logit: k = 150 Spam:Training Time logit: k = 500 6 Spam: Training time

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 40 omparisons with Other Algorithms We conducted extensive experiments with the VW algorithm, which has the same variance as random projections. Vowpal Wabbit (VW) refers to the algorithm in (Weinberger et. al., IML 2009). onsider two sets S 1, S 2, f 1 = S 1, f 2 = S 2, a = S 1 S 2, R = S 1 S 2 S 1 S 2 = a f 1 +f 2 a. Then, from their variances V ar (â V W ) 1 ( f1 f 2 + a 2) k ( ) V ar ˆRMINWISE = 1 R (1 R) k we can immediately see the significant advantages of minwise hashing, especially when a 0 (which is common in practice).

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 41 omparing b-bit Minwise Hashing with VW 8-bit minwise hashing (dashed, red) with k 200 (and 1) achieves about the same test accuracy as VW with k = 10 4 10 6. 1,10, = 1 = 0.1 10, 98 0.1 96 94 = 0.01 = 0.01 92 88 86 84 svm: VW vs hashing 82 Spam: Accuracy 10 4 10 5 10 6 k 98 96 94 92 88 86 84 82 10, 1 = 0.1 = 0.01 = 1 = 0.1 = 0.01 10 4 10 5 10 6 k 10 logit: VW vs hashing Spam: Accuracy

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 42 8-bit hashing is substantially faster than VW (to achieve the same accuracy). svm: VW vs hashing = = 10 = 1,0.1,0.01 = = 10 Spam: Training time = 1,0.1,0.01 10 4 10 5 10 6 k 10,1.0,0.1 = 0.01 =,10,1 = 0.1,0.01 logit: VW vs hashing Spam: Training time 10 4 10 5 10 6 k

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 43 Experimental Results on Rcv1 (200GB) Test accuracy using linear SVM (an not train the original data) 70 60 svm: k =50 6 2 rcv1: Accuracy 50 70 60 svm: k = 6 2 rcv1: Accuracy 50 70 60 svm: k =150 6 2 rcv1: Accuracy 50 70 svm: k =200 6 2 60 rcv1: Accuracy 50 70 60 svm: k =300 6 2 rcv1: Accuracy 50 70 60 svm: k =500 6 2 rcv1: Accuracy 50

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 44 Training time using linear SVM svm: k=50 rcv1: Train Time svm: k=200 12 12 16 6 12 rcv1: Train Time svm: k= 12 6 12 rcv1: Train Time svm: k=300 6 rcv1: Train Time 12 svm: k=150 6 12 rcv1: Train Time svm: k=500 6 12 rcv1: Train Time

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 45 Test accuracy using logistic regression 70 60 50 logit: k =50 6 2 rcv1: Accuracy 10 2 70 60 logit: k = 6 2 rcv1: Accuracy 50 70 60 logit: k =150 6 2 rcv1: Accuracy 50 70 60 logit: k =200 6 2 rcv1: Accuracy 50 70 60 logit: k =300 6 2 rcv1: Accuracy 50 70 60 logit: k =500 6 2 rcv1: Accuracy 50

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 46 Training time using logistic regression logit: k=50 logit: k= logit: k=150 8 6 2 rcv1: Train Time rcv1: Train Time 6 2 logit: k=200 logit: k=300 12 8 6 rcv1: Train Time rcv1: Train Time 6 12 8 6 rcv1: Train Time logit: k=500 rcv1: Train Time 12 6 12

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 47 omparisons with VW on Rcv1 Test accuracy using linear SVM 70 60 = 0.01,0.1,1,10 = 0.1,1,10 = 0.01 rcv1: Accuracy svm: VW vs 1 bit hashing 50 10 4 K 70 60 = 0.01 = 10 = 0.1,1,10 = 0.01 rcv1: Accuracy svm: VW vs 8 bit hashing 50 10 4 K 70 60 = 0.1,1,10 = 0.01,0.1,1,10 = 0.01 rcv1: Accuracy svm: VW vs 2 bit hashing 50 10 4 K 70 60 = 0.01 = 10 = 0.1,1,10 = 0.01 rcv1: Accuracy svm: VW vs 12 bithashing 50 10 4 K 70 = 10 = 0.01 = 0.1,1,10 = 0.01 60 rcv1: Accuracy svm: VW vs 4 bit hashing 50 10 4 K 70 60 = 10 = 0.01 = 0.1,1,10 = 0.01 rcv1: Accuracy svm: VW vs 16 bit hashing 50 10 4 K

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 48 Test accuracy using logistic regression 70 = 0.01,0.1,1,10 = 0.1,1,10 = 0.01 60 rcv1: Accuracy logit: VW vs 1 bithashing 50 10 4 K 70 = 0.1,1,10 = 0.01 = 0.01,0.1,1,10 60 rcv1: Accuracy = 1 logit: VW vs 2 bithashing 50 10 4 K 70 = 0.1,1,10 = 0.01 = 0.1,1,10 = 0.01 60 rcv1: Accuracy logit: VW vs 4 bithashing 50 10 4 K 70 = 0.1,1,10 = 0.01 = 0.1,1,10 = 0.01 60 rcv1: Accuracy logit: VW vs 8 bithashing 50 10 4 K 70 = 0.1,1,10 = 0.01 = 0.1,1,10 = 0.01 60 rcv1: Accuracy logit: VW vs 12 bithashing 50 10 4 K 70 60 = 0.01 = 10 = 0.1,1,10 = 0.01 logit: VW vs 16 bithashing 50 10 4 K

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 49 Training time rcv1: Train Time = 10 = 1 = 0.1 = 10 = 1 = 0.1 = 0.01 = 0.01 svm: VW vs 8 bit hashing 10 4 K rcv1: Train Time = 10 = 1 = 0.1 = 0.01 = 1 = 0.1 = 10 = 0.01 logit: VW vs 8 bithashing 10 4 K

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 50 onclusion Minwise Hashing is the standard technique in the context of search. b-bit minwise hashing is a substantial improvement by using only b bits per hashed value instead of 64 bits (e.g., a 20-fold reduction in space). Our work is the first proposal to use minwise hashing in linear learning. However, the resultant data dimension may be too large to be practical. b-bit minwise hashing provides a very simple solution, by dramatically reducing the dimensionality from (e.g.,) 2 64 k to 2 b k. Thus, b-bit minwise hashing can be easily integrated with linear learning, to solve extremely large problems, especially when data do not fit in memory. b-bit minwise hashing has substantial advantages (in many aspects) over other methods such as random projections and VW.

Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 51 Questions?