Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Similar documents
The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Artificial Intelligence Roman Barták

Robotics 2 AdaBoost for People and Place Detection

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

18.9 SUPPORT VECTOR MACHINES

Spectral Hashing: Learning to Leverage 80 Million Images

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

ECE 661: Homework 10 Fall 2014

CS7267 MACHINE LEARNING

ECE521 Lecture7. Logistic Regression

Reconnaissance d objetsd et vision artificielle

Stephen Scott.

Machine Learning for Signal Processing Bayes Classification and Regression

Linear Spectral Hashing

Anomaly Detection. Jing Gao. SUNY Buffalo

Logic and machine learning review. CS 540 Yingyu Liang

18.6 Regression and Classification with Linear Models

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Introduction to Machine Learning Midterm Exam

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Metric-based classifiers. Nuno Vasconcelos UCSD

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Large-Margin Thresholded Ensembles for Ordinal Regression

Logistic Regression. Machine Learning Fall 2018

10-701/ Machine Learning - Midterm Exam, Fall 2010

Face detection and recognition. Detection Recognition Sally

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Jeffrey D. Ullman Stanford University

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS534 Machine Learning - Spring Final Exam

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Multiclass Classification-1

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

COMP90051 Statistical Machine Learning

Randomized Algorithms

Semi Supervised Distance Metric Learning

Statistical learning. Chapter 20, Sections 1 4 1

The Power of Asymmetry in Binary Hashing

Global Scene Representations. Tilke Judd

Algorithm-Independent Learning Issues

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

CSE446: non-parametric methods Spring 2017

Semi-Supervised Learning of Speech Sounds

Machine Learning, Fall 2009: Midterm

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Performance evaluation of binary classifiers

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Numerical Learning Algorithms

Discriminative Learning and Big Data

Intelligent Systems:

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Representing Images Detecting faces in images

Generative v. Discriminative classifiers Intuition

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

6.036 midterm review. Wednesday, March 18, 15

Clustering with k-means and Gaussian mixture distributions

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Generative v. Discriminative classifiers Intuition

Lecture 8. Instructor: Haipeng Luo

Stochastic Gradient Descent

Name (NetID): (1 Point)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Visual Object Detection

Large-Margin Thresholded Ensembles for Ordinal Regression

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

9 Classification. 9.1 Linear Classifiers

CS 231A Section 1: Linear Algebra & Probability Review

Classification: The rest of the story

Lecture 14 October 16, 2014

Web Search and Text Mining. Learning from Preference Data

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

Lecture 2 Machine Learning Review

Beyond Spatial Pyramids

Mathematical Formulation of Our Example

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Learning from Examples

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Lecture 5 Neural models for NLP

Data Mining und Maschinelles Lernen

Variations of Logistic Regression with Stochastic Gradient Descent

Transcription:

Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006

Task-specific similarity A toy example:

Task-specific similarity A toy example:

Task-specific similarity A toy example: Norm Angle

The problem Learn similarity from examples of what is similar [or not]. Binary similarity: for x, y X S(x, y) = { + if x and y are similar, if they are dissimilar. Two goals in mind: Similarity detection: judge whether two entities are similar. Similarity search: given a query entity and a database, find examples in a database that are similar to the query. Our approach: learn an embedding of the data into a space where similarity corresponds to a simple distance. 2

Task-specific similarity Articulated pose: 3

Task-specific similarity Visual similarity of image patches: 4

Related prior work Learning parametrized distance metric [Xing et al; Roweis], such as Mahalanobis distances. Lots of work on low-dimensional graph embedding (MDS, LLE,... ) - but unclear how to generalize without relying on distance in X. Embedding known distance [Athitsos et al]: assumes known distance, uses embedding to approximate/speed up. DistBoost [Hertz et al]: learning distance for classification / clustering setup. Locality sensitive hashing: fast similarity search. 5

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 00 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 00 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 00 0 0 6

Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 00 0 0 6

Locality sensitive hashing [Indyk et al] A family H of functions is locality sensitive if P h U[H] (h(x 0 ) = h(x) x 0 x r) p, P h U[H] (h(x 0 ) = h(x) x 0 x R) p 2. Uses the gap between TP and FP rates; amplified by concatenating functions into a hash key. Projections on random lines are locality sensitive w.r.t. L p norms, p 2 [Gionis et al, Datar et al]. 7

How is this relevant? LSH is excellent if L p is all we want. L p may not be a suitable proxy for S: we may Waste lots of bits on irrelevant features; Miss pairs similar under S but not close w.r.t. L p If we know what S is, may be able to analytically design embedding of X into L space [Thaper&Indyk, Grauman&Darrell]. We will instead learn LSH-style binary functions that fit S as conveyed by examples. 8

Our approach Given: pairs of similar [and pairs of dissimilar] examples, based on the hidden binary similarity S. Two related tasks: Similarity judgment: S(x, y) =? Given a query x 0 ; need to find {x i : S(x i, x 0 ) = +}. Our solution to both problems: a similarity sensitive embedding H(x) = [α h (x),..., α M h M (x)] ; We will learn h m (x) {0, } and α m. 9

Desired embedding properties Embedding H(x) = [α h (x),..., α M h M (x)]: Rely on L (= weighted Hamming) distance H(x ) H(x 2 ) = M m= α m h m (x ) h m (x 2 ) H is similarity sensitive: for some R, want high P x,x 2 p(x)( H(x ) H(x 2 ) R S(x, x 2 ) = +), low P x,x 2 p(x)( H(x ) H(x 2 ) R S(x, x 2 ) = ). H(x) H(y) is a proxy for S. 0

Projection-based classifiers For a projection f : X R, consider, for some T R, h(x; f, T ) = { if f(x) T 0 if f(x) > T. This defines a simple similarity classifier of pairs: c(x, y; f, T ) = + h(x; f, T ) = h(y; f, T ) q y z b a h(x; f, T ) = 0 T h(x; f, T ) = f(x) c(q, y; f, T ) = c(q, z; f, T ) = +, c(q, a; f, T ) = c(q, b; f, T ) =.

Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). f(x) 2

Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). 0 f(x) 2

Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). 0 f(x) 2

Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). 0 2 0 2 2 0 f(x) 2

Selecting the threshold Also need to consider negative examples (dissimilar pairs), and estimate the gap: true positive rate TP minus false positive rate FP. POS NEG f(x) 3

Selecting the threshold Also need to consider negative examples (dissimilar pairs), and estimate the gap: true positive rate TP minus false positive rate FP. 0 0 2 0 2 POS NEG f(x) 0 2 2 3 3

Optimization of h(x; f, T ) Objective (TP FP gap): argmax f,t N + i= N c(x + i, x+ i ) j= c(x j, x j ) Parametric projection families, e.g. f(x) = j θ jx p (d j )xp 2 (d 2 j ) Soft versions of h and c make the gap differentiable w.r.t. θ, T : h(x; f, T ) = + exp (f(x) T )) c(x, y) = 4 (h(x) /2) (h(y) /2) 4

Ensemble classifiers A weighted embedding H(x) = [α h (x),..., α M h M (x)]. Each h m (x) defines a classifier c m. Together they form an ensemble classifier of similarity: C(x, y) = sgn ( M m= α m c m (x, y) We will construct the ensemble of bit-valued hs by a greedy algorithm based on AdaBoost (operting on the corresponding cs). ). 5

Boosting [Schapire et al] Assumes access to weak learner that can at every iteration m produce a classifier c m better than chance. Main idea: maintain a distribution W m (i) on the training examples, and update it according to the prediction of c m : If c m (x i ) is correct, then W m+ (i) goes down; Otherwise, W m+ (i) increases ( steering c m+ towards it.) Our examples are pairs, weak classifiers are thresholded projections. To evaluate threshold, we will calculate the weight of separated pairs, rather than count them. 6

BoostPro Given pairs (x () i, x (2) i ) labeled with l i = S(x () i, x (2) i ): : Initialize weights W (i), to uniform. 2: for all m =,..., M do 3: Find f, T = argmax f,t r m (f, T ) using gradient descent on r m (f, T ) = N i= W m (i)l i c m (x () i, x (2) i ). 4: Set h m h(x; f, T ). 5: Set α m (see Boosting papers.) 6: Update weights: W m+ (i) W m (i) exp ( ) l i c m (x () i, x (2) i ). 7

Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 8

Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. T f( x) 8

Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. π T ( π ) T π T = Pr(f(x) T ) T f( x) 8

Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 2 π T ( π T ) T f( x) π T = Pr(f(x) T ) 8

Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 2 2 π T ( π T ) T f( x) π T = Pr(f(x) T ) 8

Semi-supervised setup Given similar pairs and unlabeled x,..., x N. Estimate TP rate as before. FP rate: Estimate the CDF of f(x); let π T = ˆP (f(x) T ). Then FP = π 2 T + ( π T ) 2. Note: this means FP /2 [Ke et al]. 9

BoostPro in a semi-supervised setup Normal boosting assumes positive and negative examples. Intuition: each unlabeled example x i represents all possible pairs (x i, y); those are, w.h.p., negative under our assumption. Two distributions: W m (j) on positive pairs, as before. S m (i) on unlabeled (single) examples. Instead of c m (x () i, x (2) i ) use the expectation E y [c m (x i, y)] to calculate the objective and set the weights. 20

Semi-supervised boosting: details Probability of misclassifying a negative (x j, y): P j = h(x j ; f, T )π + ( h(x j ; f, T ))( π). E y [c(x j, y)] = P j (+) + ( P j ) ( ) = 2P j. Modified boosting objective: r = N p i= W (i)c(x () i, x (2) i ) N j= S(j)E y [c(x j, y)] = N p i= W (i)c(x () i, x (2) i ) N j= S j (2P j ). 2

Results: Toy problems M =00, trained on 600 unlabeled points +,000 similar pairs 22

Results: Toy problems 23

Results: UCI data sets Data Set L PSH BoostPro M MPG 3.9436 ± 5.276 0.768 ± 4.340 7.4905 ± 2.5907 80± 20 CPU 37.980 ± 5.2729 59.3767 ± 7.486 9.0846 ± 0.9953 5 ± 48 Housing 26.52 ± 6.8080 3.8464 ± 9.2756 3.8436 ± 8.488 20 ± 28 Abalone 4.786 ± 0.580 5.0842 ± 0.4960 4.7602 ± 0.4384 43 ± 8 Census 2.493 0 9 ± 3.3 0 8 2.237 0 9 ± 3.2 0 8.566 0 9 ± 2.4 0 8 49 ± 0 Test error on regression benchmark data from UCI/Delve. Mean ± std. deviation of MSE using locally-weighted regression. The last column shows the values of M (dimension of embedding H). Similarity defined in terms of target function values. 24

Results: pose retrieval Input 3 nearest neighbors in H H built with semi-supervised BoostPro, on 200,000 examples; M = 400 25

Results: pose retrieval Recall for k-nn retrieval. For each value of k, the fraction of true k-nn w.r.t. pose within the k-nn w.r.t. an image-based similarity measure is plotted. Black dotted line: L on EDH. Red dash-dot: PSH. Blue solid: BoostPro, M =000. 26

Visual similarity of image patches Define two patches to be similar under any rotation and mild shift (± 4 pixels). Should be covered by many reasonable similarity definitions. 27

Descriptor : sparse overcomplete code Generative model of patches [Olshausen&Field] Very unstable under transformation, hence L is not a good proxy for similarity. With BoostPro, improvement in area under ROC from 0.56 to 0.68. 28

Descriptor 2: SIFT Scale-Invariant Feature Transform [Lowe] Histogram of gradient magnitude and orientation within the region, normalized by orientation and at an appropriate scale. Discriminative (can not generate a patch). Designed specifically to make two similar patches match. 29

Results Area under ROC curve: L on SIFT L on H(SIFT) 0.8794 0.9633 30

Conclusions It is beneficial to learn similarity directly for the task rather than rely on the default distance. Key property of our approach: similarity search reduced to L neighbor search (thus can be done in sublinear time.) Most useful for: Regression and multi-label classification; When large amounts of labeled data are available; When L p distance is not a good proxy for similarity. 3

Open questions What kinds of similarity concepts can we learn? How do we explore the space of projections more efficiently? Factorization of similarity. Combining unsupervised and supervised similarity learning for image regions. 32

Questions?.. 33