Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Similar documents
An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

COMS 4771 Introduction to Machine Learning. Nakul Verma

Sample Complexity of Learning Mahalanobis Distance Metrics

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Generalization Bounds and Stability

Linear Classifiers and the Perceptron

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

IFT Lecture 7 Elements of statistical learning theory

Computational and Statistical Learning Theory

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Understanding Generalization Error: Bounds and Decompositions

Stephen Scott.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Support vector machines Lecture 4

1 Review of The Learning Setting

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

FORMULATION OF THE LEARNING PROBLEM

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Introduction to Machine Learning

The PAC Learning Framework -II

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Probabilistic Machine Learning. Industrial AI Lab.

4 Bias-Variance for Ridge Regression (24 points)

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Active Learning and Optimized Information Gathering

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

A first model of learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Empirical Risk Minimization

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Class 2 & 3 Overfitting & Regularization

Data Preprocessing. Cluster Similarity

Generalization theory

Computational Learning Theory. CS534 - Machine Learning

Advanced Introduction to Machine Learning CMU-10715

References for online kernel methods

CPSC 540: Machine Learning

ECE521 week 3: 23/26 January 2017

COMS 4771 Regression. Nakul Verma

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Foundations of Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Part of the slides are adapted from Ziko Kolter

Generalization, Overfitting, and Model Selection

University of Florida CISE department Gator Engineering. Clustering Part 1

Introduction to Logistic Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

18.9 SUPPORT VECTOR MACHINES

Generalization bounds

1 Differential Privacy and Statistical Query Learning

The Perceptron algorithm

Optimization and Gradient Descent

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Computational Learning Theory

Weighted Majority and the Online Learning Approach

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Logistic Regression Logistic

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Machine Learning 4771

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CSC 411: Lecture 03: Linear Classification

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

1 A Support Vector Machine without Support Vectors

Efficient Bandit Algorithms for Online Multiclass Prediction

GI01/M055: Supervised Learning

Optimistic Rates Nati Srebro

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Adaptive Sampling Under Low Noise Conditions 1

A Pseudo-Boolean Set Covering Machine

Dan Roth 461C, 3401 Walnut

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

VC dimension and Model Selection

Does Unlabeled Data Help?

DATA MINING AND MACHINE LEARNING

Machine Learning Theory (CS 6783)

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Introduction to Statistical Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

The definitions and notation are those introduced in the lectures slides. R Ex D [h

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational and Statistical Learning Theory

Online Advertising is Big Business

Lecture 7 Introduction to Statistical Decision Theory

Computational Learning Theory

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Linear regression COMS 4771

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 188: Artificial Intelligence Spring Announcements

Machine Learning Practice Page 2 of 2 10/28/13

Ordinary Least Squares Linear Regression

Minimax risk bounds for linear threshold functions

Lecture 10: Broadcast Channel and Superposition Coding

Manual for a computer class in ML

Transcription:

Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI

feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features are equally weighted) feature 1 (using weighting mechanism M) [sq. Mahalanobis dist] Q: What should be the correct weighting M? A: Data-driven. Given data of interest, learn a metric (M), which helps in the prediction task.

Learning a Mahalanobis Metric Suppose we want M s.t.: - data from same class distance U - data from different classes distance L [ U < L ] Given two labelled samples - the distance between the pair - label agreement between the pair from a sample S. Then Define a pairwise penalty function So total error: (empirical error over the sample S) (error of M over the sample S) (generalization error) (error of M over the (unseen) population)

Statistical consistency of Metric Learning Best possible metric on the population: Best possible metric on the sample S (of size m) [drawn independently from the population] Questions we want to answer: (i) Does as m? (consistency) (ii) At what rate does? (finite sample rates) (iii) What factors affect the rate? (data dim, feature info content)

What we show: Theorem 1 Given a D-dimensional feature space. For any -Lipschitz penalty function, and any sample size m, (with probability at least 1- over the draw of the sample) If we want, then we require This gives us consistency as well as a rate! Question: Is the convergence rate on the data dimension D tight?

What we show: Theorem 2 Given a D-dimensional feature space. For any metric learning algorithm A that (given a sample S m ) returns There exists a -Lipschitz penalty function, s.t. for all,, if sample size then Dependence on the representation dimension D is tight! Remark: this is the worst case analysis in the absence of any other information about the data distribution. Can we refine our results if we know about the quality of our feature set?

Quantifying feature-set quality Quantifying the quality of our feature set. Observation: not all features are created equal Each feature has a different information content for the prediction task. Fix a particular prediction task T. Let M be the optimal feature weighting for task T. Define the metric learning complexity d* for task T as: d* is unknown a priori Question: Can we get a sample complexity rate that only depends on d*?

What we show: Theorem 3 Given a D-dimensional feature space, and a prediction task T with (unknown) metric learning complexity d* For any -Lipschitz penalty function, and any sample size m, (with probability at least 1- over the draw of the sample) Take home message: regularization can help adapt to the unknown metric learning complexity!

Empirical Evaluation Want to study Given a dataset with small metric learning complexity, but high representation dimension. How do regularized vs. unregularized Metric Learning algs. fare? Approach pick benchmark datasets of low dimensionality (d) augment each dataset with large (D dim.) corr. noise UCI dataset dim (d) Iris 4 Wine 13 Ionosphere 34 for each orig. sample x i, augmented sample x i = [x i x ] (we can now control signal-noise ratio) study the prediction accuracy of regularized & unregularized Metric Learning algorithms as a function of noise dimension.

Empirical Evaluation

Theorem 1 Given a D-dimensional feature space. For any -Lipschitz penalty function and any sample size m, (with probability at least 1- over the draw of the sample) If we want, then we require This gives us consistency as well as a rate! How can we prove this?

Proof Idea (Theorem 1) Want to find a sample size m such that for any weighting M empirical performance of M generalization performance of M Try 1 (covering argument) Then, choosing the best M on samples, will have close to best generalization performance! Fix a weighting metric M, define random variable By Hoeffding s bound (since Z is bounded r.v.) if is bounded per example pair w.p. 1- over the draw of S m But, we want to have a similar result for all M! Collection of all M For all M

Proof Idea (Theorem 1) Try 2 (VC argument) If we view metric learning as classification, we can apply VC-style results! Recall: given a (binary) classification class F, for all f F Recall: VC-theory is only for binary classification w.p. 1- over the draw of S m For metric learning: say penalty function is binary threshold on distance. F = { M : M } maximum sample size that can achieve all possible labels from using f F + labeling => M for a pair is small labeling => M for a pair is large what is the maximum number of pairs which can attain all labeling from F? (VC complexity of ellipsoids) (a) only works for thresholds on (b) cannot adapt to quality of the feature space!

Proof Idea (Theorem 1) Try 3 (Rademacher Complexity argument) F = { M : M } Rademacher Complexity: given a class F, how well does some f F correlate to binary noise {-1, 1}. Then for all f w.p. 1- over the draw of S m For metric learning for scale restricted metrics M, M T M 2 D (a) works for any Lipschitz (b) can adapt to quality of the feature space!

Theorem 2 Given a D-dimensional feature space. For any metric learning algorithm A that (given a sample S m ) returns There exists a -Lipschitz penalty function, s.t. for all,, if sample size then Dependence on the representation dimension D is tight! How can we prove this?

Proof Idea (Theorem 2) Try 1: (VC argument, by treating Metric Learning as classification) If we can lower bound m, then a standard construction gives a specific distribution on which we must have (m/ 2 ) samples to get accuracy within. Since, we work with pairs of points, the specific distribution for VC argument doesn t actually ever occur! (we need this distribution to be a product distribution) Try 2: (Our approach -- deconstruct the VC argument) We ll use the probabilistic method. Create a collection of distributions such that if one of them is chosen at random then the generalization error of M returned by A would be large. So there is some distribution in the collection which has large error. These distributions constructed so that Metric Learning acts as classification.

Proof Idea (Theorem 2) Construction: (point masses on the vertices regular simplex) Collection of distributions: each vertex is labeled + or (randomly) with bias ½ + Loss function: Key insight: for this collection of distributions and this loss function the problem reduces to binary classification in the product space! For m i.i.d. samples from a randomly selected dist. from the collection any empirical error minimizing algorithm would require m (D/ 2 ) Other possible approaches: [ U = 0, L = 1 ] How? Calculate minimum number of samples required to distinguish the bias of two coins. Repeat it for ~D/2 pairs. Use information-theoretic arguments to establish minimum number of samples needed to distinguish good metric from bad ones. (e.g. use Fano s inequality)

Theorem 3 Given a D-dimensional feature space, and a prediction task T with (unknown) metric learning complexity d* For any -Lipschitz penalty function and any sample size m, (with probability at least 1- over the draw of the sample) Take home message: regularization can help adapt to the unknown metric learning complexity!

... Proof Idea (Theorem 3) Using Rademacher complexity argument, already shown: If we know M* has small norm (say d << D), then we are done! D but don t know the norm of the best metric a priori w.p. 1- over the draw of sample of size m Will use a refinement trick Observation: we are allowed to fail fraction of time, we distribute this over each class / D A refinement of M T M 2 D For all d D and all M d (s.t. M d T M d 2 d) M T M 2 2 M T M 2 1

Proof Idea (Theorem 3) So, if the algorithm picks: Then (w.p. 1- ):

Comparison with previous results Previous results For thresholds on convex Our results For general Lipschitz with ERM Convergence rate (upper bound) Stable and regularized algs. Convergence rate (lower bound) No known results In absence of any other information, exists Lipschitz, with ERM Theorem 1 Data complexity d* No known results Theorem 2 For gen. Lipschitz with regularized ERM Theorem 3

Open problems Analysis of Metric Learning in Online and Active Learning framework? Non-linear metric learning? Structured metric learning? (ranking problems, clustering problems, etc)

Questions / Discussion

Thank You!