Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Similar documents
Distribution-specific analysis of nearest neighbor search and classification

Consistency of Nearest Neighbor Classification under Selective Sampling

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Consistency of Nearest Neighbor Methods

Local regression, intrinsic dimension, and nonparametric sparsity

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Lecture 4 Discriminant Analysis, k-nearest Neighbors

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Sample complexity bounds for differentially private learning

Nearest Neighbor Pattern Classification

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Supervised Learning: Non-parametric Estimation

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Machine Learning for Signal Processing Bayes Classification

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Machine Learning (67577) Lecture 5

Random forests and averaging classifiers

Announcements. Proposals graded

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning for Signal Processing Bayes Classification and Regression

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

16 Embeddings of the Euclidean metric

k-nn Regression Adapts to Local Intrinsic Dimension

Introduction to Machine Learning

MIRA, SVM, k-nn. Lirong Xia

Active learning. Sanjoy Dasgupta. University of California, San Diego

Recap from previous lecture

Active Learning from Single and Multiple Annotators

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Fast learning rates for plug-in classifiers under the margin condition

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Curve learning. p.1/35

Machine Learning Practice Page 2 of 2 10/28/13

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

1 Differential Privacy and Statistical Query Learning

The Curse of Dimensionality for Local Kernel Machines

Plug-in Approach to Active Learning

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

The sample complexity of agnostic learning with deterministic labels

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 35: December The fundamental statistical distances

Linear and Logistic Regression. Dr. Xiaowei Huang

Geometric inference for measures based on distance functions The DTM-signature for a geometric comparison of metric-measure spaces from samples

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Atutorialonactivelearning

Machine Learning

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 3 Luca Trevisan August 31, 2017

Lecture 3: Lower Bounds for Bandit Algorithms

Practical Agnostic Active Learning

Nonparametric Methods Lecture 5

Classification objectives COMS 4771

Geometric Inference for Probability distributions

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 3: Statistical Decision Theory (Part II)

probability of k samples out of J fall in R.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Manifold Regularization

9 Classification. 9.1 Linear Classifiers

Machine Learning

Day 5: Generative models, structured classification

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Pseudo-Poincaré Inequalities and Applications to Sobolev Inequalities

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Probability and Measure

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n

Analysis of Probabilistic Systems

FUNCTIONAL ANALYSIS CHRISTIAN REMLING

Computational and Statistical Learning Theory

ECE521 week 3: 23/26 January 2017

Bayesian Learning (II)

A first model of learning

Curve Fitting Re-visited, Bishop1.2.5

Naïve Bayes classification

Does Unlabeled Data Help?

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Computational Learning Theory

Machine Learning: A Statistics and Optimization Perspective

Normed Vector Spaces and Double Duals

Understanding Generalization Error: Bounds and Decompositions

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Single layer NN. Neuron Model

ECE521 Lecture7. Logistic Regression

FORMULATION OF THE LEARNING PROBLEM

Lecture 3: Introduction to Complexity Regularization

Problem Set 2: Solutions Math 201A: Fall 2016

Qualifier: CS 6375 Machine Learning Spring 2015

Linear & nonlinear classifiers

A VERY BRIEF REVIEW OF MEASURE THEORY

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Transcription:

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego

Nearest neighbor The primeval approach to classification. Given: a training set {(x 1, y 1 ),..., (x n, y n )} consisting of data points x i X and their labels y i {0, 1} a query point x predict the label of x by looking at its nearest neighbor among the x i. How accurate is this method? What kinds of data is it wellsuited to?

A nonparametric estimator Contrast with linear classifiers, which are also simple and generalpurpose. Expressivity: what kinds of decision boundary can it produce? Consistency: as the number of points n increases, does the decision boundary converge? Rates of convergence: how fast does this convergence occur, as a function of n? Style of analysis.

The data space x d(x,x') x' Data points lie in a space X with distance function d : X X R. Most common scenario: X = R p and d is Euclidean distance. Our setting: (X, d) is a metric space. lp distances Metrics obtained from user preferences/feedback Also of interest: more general distances. KL divergence Domainspecific dissimilarity measures

Statistical learning theory setup Training points come from the same source as future query points: Underlying measure µ on X from which all points are generated. We call (X, d, µ) a metric measure space. Label of x is a coin flip with bias η(x) = Pr(Y = 1 X = x). A classifier is a rule h : X {0, 1}. Misclassification rate, or risk: R(h) = Pr(h(X ) Y ). The Bayesoptimal classifier h (x) = { 1 if η(x) > 1/2 0 otherwise, has minimum risk, R = R(h ) = E X min(η(x ), 1 η(x )).

Questions of interest Let h n be a classifier based on n labeled data points from the underlying distribution. R(h n ) is a random variable. Consistency: does R(h n ) converge to R? Rates of convergence: how fast does convergence occur? The smoothness of η(x) = Pr(Y = 1 X = x) matters: (x) (x) x x Questions of interest: Consistency without assumptions? A suitable smoothness assumption, and rates? Rates without assumptions, using distributionspecific quantities?

Talk outline 1. Consistency without assumptions 2. Rates of convergence under smoothness 3. General rates of convergence 4. Open problems

Consistency Given n data points (x 1, y 1 ),..., (x n, y n ), how to answer a query x? 1NN returns the label of the nearest neighbor of x amongst the x i. knn returns the majority vote of the k nearest neighbors. k n NN lets k grow with n. 1NN and knn are not, in general, consistent. E.g. X = R and η(x) η o < 1/2. Every label is a coin flip with bias η o. Bayes risk is R = η o (always predict 0). 1NN risk: what is the probability that two coins of bias η o disagree? ER(h n ) = 2η o (1 η o ) > η o. And knn has risk ER(h n ) = η o f (k). Henceforth h n denotes the k n classifier, where k n.

Consistency results under continuity Assume η(x) = P(Y = 1 X = x) is continuous. Let h n be the k n classifier, with k n and k n /n 0. Fix and Hodges (1951): Consistent in R p. CoverHart (1965, 1967, 1968): Consistent in any metric space. Proof outline: Let x be a query point and let x (1),..., x (n) denote the training points ordered by increasing distance from x. Training points are drawn from µ, so the number of them in any ball B is roughly nµ(b). Therefore x (1),..., x (kn) lie in a ball centered at x of probability mass k n /n. Since k n /n 0, we have x (1),..., x (kn) x. By continuity, η(x (1) ),..., η(x (kn)) η(x). By law of large numbers, when tossing many coins of bias roughly η(x), the fraction of 1s will be approximately η(x). Thus the majority vote of their labels will approach h (x).

Universal consistency in R p Stone (1977): consistency in R p assuming only measurability. Lusin s thm: for any measurable η, for any ɛ > 0, there is a continuous function that differs from it on at most ɛ fraction of points. Training points in the red region can cause trouble. What fraction of query points have one of these as their nearest neighbor? Geometric result: pick any set of points in R p. Then any one point is the NN of at most 5 p other points. An alternative sufficient condition for arbitrary metric measure spaces (X, d, µ): that the fundamental theorem of calculus holds.

Universal consistency in metric spaces Query x; training points by increasing distance from x are x (1),..., x (n). 1. Earlier argument: under continuity, η(x (1) ),..., η(x (kn)) η(x). In this case, the k n NN are coins of roughly the same bias as x. 2. It suffices that average(η(x (1) ),..., η(x (kn))) η(x). 3. x (1),..., x (kn) lie in some ball B(x, r). For suitable r, they are random draws from µ restricted to B(x, r). 4. average(η(x (1) ),..., η(x (kn))) is close to the average η in this ball: 1 η dµ. µ(b(x, r)) B(x,r) 5. As n grows, this ball B(x, r) shrinks. Thus it is enough that 1 lim η dµ = η(x). r 0 µ(b(x, r)) B(x,r) In R p, this is Lebesgue s differentiation theorem.

Universal consistency in metric spaces Let (X, d, µ) be a metric measure space in which the Lebesgue differentiation property holds: for any bounded measurable f, 1 lim f dµ = f (x) r 0 µ(b(x, r)) for almost all (µa.e.) x X. B(x,r) If k n and k n /n 0, then R n R in probability. If in addition k n / log n, then R n R almost surely. Examples of such spaces: finitedimensional normed spaces; doubling metric measure spaces.

Talk outline 1. Consistency without assumptions 2. Rates of convergence under smoothness 3. General rates of convergence 4. Open problems

Smoothness and margin conditions The usual smoothness condition in R p : η is αholder continuous if for some constant L, for all x, x, η(x) η(x ) L x x α. MammenTsybakov βmargin condition: For some constant C, for any t, we have µ ({x : η(x) 1/2 t}) Ct β. (x) Widtht margin around decision boundary 1 1/2 x AudibertTsybakov: Suppose these two conditions hold, and that µ is supported on a regular set with 0 < µ min < µ < µ max. Then ER n R is Ω(n α(β1)/(2αp) ). Under these conditions, for suitable (k n ), this rate is achieved by k n NN.

A better smoothness condition for NN How much does η change over an interval? (x) The usual notions relate this to x x. x x 0 For NN: more sensible to relate to µ([x, x ]). We will say η is αsmooth in metric measure space (X, d, µ) if for some constant L, for all x X and r > 0, η(x) η(b(x, r)) L µ(b(x, r)) α, where η(b) = average η in ball B = 1 µ(b) B η dµ. η is αholder continuous in R p, µ bounded below η is (α/p)smooth.

Rates of convergence under smoothness Let h n,k denote the knn classifier based on n training points. Let h be the Bayesoptimal classifier. Suppose η is αsmooth in (X, d, µ). Then for any n, k, 1. For any δ > 0, with probability at least 1 δ over the training set, Pr X (h n,k (X ) h (X )) δ µ({x : η(x) 1 2 C 1 1 k ln 1 δ }) under the choice k n 2α/(2α1). 2. E n Pr X (h n,k (X ) h (X )) C 2 µ({x : η(x) 1 2 C 1 3 k }). These upper and lower bounds are qualitatively similar for all smooth conditional probability functions: the probability mass of the width 1 k margin around the decision boundary.

Talk outline 1. Consistency without assumptions 2. Rates of convergence under smoothness 3. General rates of convergence 4. Open problems

General rates of convergence For sample size n, can identify positive and negative regions that will reliably be classified: decision boundary Probabilityradius: Grow a ball around x until probability mass p: r p (x) = inf{r : µ(b(x, r)) p}. Probabilityradius of interest: p = k/n. Reliable positive region: X p, = {x : η(b(x, r)) 1 2 for all r r p(x)} where 1/ k. Likewise negative region, X p,. Effective boundary: p, = X \ (X p, X p, ). Roughly, Pr X (h n,k (X ) h (X )) µ( p, ).

Open problems 1. Necessary and sufficient conditions for universal consistency in metric measure spaces. 2. Consistency in more general topological spaces. 3. Extension to countably infinite label spaces. 4. Applications of convergence rates: active learning, domain adaptation,...

Thanks To my coauthor Kamalika Chaudhuri and to the National Science Foundation for support under grant IIS1162581.