Consistency of Nearest Neighbor Methods

Similar documents
Understanding Generalization Error: Bounds and Decompositions

Fast learning rates for plug-in classifiers under the margin condition

Lecture Support Vector Machine (SVM) Classifiers

Least Squares Regression

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines for Classification and Regression

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Classification objectives COMS 4771

Generalization bounds

IFT Lecture 7 Elements of statistical learning theory

Least Squares Regression

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Introduction to Machine Learning (67577) Lecture 5

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Random forests and averaging classifiers

Computational and Statistical Learning theory

Foundations of Machine Learning

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Introduction to Machine Learning (67577) Lecture 3

Machine Learning 4771

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Recap from previous lecture

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

The sample complexity of agnostic learning with deterministic labels

Statistical learning theory, Support vector machines, and Bioinformatics

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Calibrated Surrogate Losses

Constrained Optimization and Lagrangian Duality

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification and Pattern Recognition

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

1 A Lower Bound on Sample Complexity

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Advanced Introduction to Machine Learning CMU-10715

1 Machine Learning Concepts (16 points)

Curve learning. p.1/35

A Simple Algorithm for Learning Stable Machines

Online Learning and Sequential Decision Making

Generalization, Overfitting, and Model Selection

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Introduction and Models

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Multiclass Classification-1

Lecture 16: Perceptron and Exponential Weights Algorithm

Nearest Neighbor Pattern Classification

A first model of learning

FORMULATION OF THE LEARNING PROBLEM

A Large Deviation Bound for the Area Under the ROC Curve

Does Unlabeled Data Help?

18.9 SUPPORT VECTOR MACHINES

Lecture 10 February 23

Metric Embedding for Kernel Classification Rules

Online Learning, Mistake Bounds, Perceptron Algorithm

1 Review of Winnow Algorithm

Statistical Properties of Large Margin Classifiers

BINARY CLASSIFICATION

Brief Introduction to Machine Learning

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

COMS 4771 Introduction to Machine Learning. Nakul Verma

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

A Uniform Convergence Bound for the Area Under the ROC Curve

Machine Learning: A Statistics and Optimization Perspective

Random projection ensemble classification

Optimal global rates of convergence for interpolation problems with random design

Lecture 8. Instructor: Haipeng Luo

Dyadic Classification Trees via Structural Risk Minimization

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Lecture 3: Introduction to Complexity Regularization

Classification and Support Vector Machine

The PAC Learning Framework -II

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Introduction to Machine Learning

MLCC 2017 Regularization Networks I: Linear Models

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Computational Learning Theory

GI01/M055: Supervised Learning

CSE 546 Final Exam, Autumn 2013

Discriminative Models

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

Noise Tolerance Under Risk Minimization

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Logistic Regression. Machine Learning Fall 2018

Announcements. Proposals graded

Solving Classification Problems By Knowledge Sets

Computational and Statistical Learning Theory

Machine Learning for Signal Processing Bayes Classification and Regression

Part of the slides are adapted from Ziko Kolter

Logistic Regression Trained with Different Loss Functions. Discussion

Midterm: CS 6375 Spring 2015 Solutions

Tufts COMP 135: Introduction to Machine Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Final Exam, Machine Learning, Spring 2009

TUM 2016 Class 1 Statistical learning theory

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

Transcription:

E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study of consistency properties of learning algorithms, where we will be interested in the question of whether the generalization error of the function learned by an algorithm approaches the Bayes error in the limit of infinite data. In particular, we will consider consistency properties of the simple k-nearest neighbor k-nn classification algorithm in the next lecture, we will investigate consistency properties of algorithms such as SVMs and AdaBoost that effectively minimize a convex upper bound on the zero-one loss. We start with a brief review of statistical consistency and related results. Recall from Lecture 10 that for any h : X Y, H Y X, loss function l : Y Y 0,, and probability distribution on X Y, the excess error of h w.r.t. and l can be described as er h er l, = er l h er l H + er l H er l, ; 1 }}}}}} excess error estimation error approximation error here er l, = inf h:x Y er l h and erl H = inf h H er l h. An algorithm A : m=1x Y m H which given a training sample S m=1x Y m returns a function h S = AS H is statistically consistent in H w.r.t. and l if for all ɛ > 0, P S m er l h S er l H ɛ 0 as m. A is universally consistent in H w.r.t. l if it is consistent in H w.r.t., l for all distributions. Similarly, an algorithm A : m=1x Y m Y X is Bayes consistent w.r.t. and l if for all ɛ > 0, P S m er l h S er l, ɛ 0 as m. A is universally Bayes consistent w.r.t. l if it is Bayes consistent w.r.t., l for all distributions. We showed in Lecture 10 that: 1 If Y = ±1}, l = l 0-1, and VCdimH <, then empirical risk minimization ERM in H is universally consistent in H in fact, one gets a fixed sample size/rate of convergence for all distributions ; this is what allowed us to establish that VCdimH finite = H learnable by ERM in H. 2 If Y = ±1}, H 1 H 2..., and VCdimH i < VCdimH i+1 i, then structural risk minimization SRM over H i is universally consistent in H = H i. This can yield universal consistency even for function classes H that do not have finite VC-dimension but that can be decomposed as a hierarchy of finite VC-dimension classes as above. 1 In particular, if H i has zero approximation error, then SRM over H i is universally Bayes consistent. While SRM can achieve Bayes consistency, it is typically not computationally feasible. Therefore, it is of interest to ask whether one can achieve Bayes consistency using other algorithms. 1 Note however that here we do not get a fixed sample size/convergence rate that works for all in fact a fixed sample size in this case would contradict the result that H is learnable only if VCdimH is finite. 1

2 Consistency of Nearest Neighbor Methods We will be interested in studying Bayes consistency primarily for binary classification algorithms. Therefore, for the rest of the lecture, we will fix Y = ±1} and l = l 0-1, and will let er abbreviate er0-1,. In this setting, define the conditional label probability η : X 0, 1 as Note that η depends on the distribution. Now define h : X ±1} as ηx = Py = 1 x. 2 h 1 if ηx 1 x = 2 1 otherwise. 3 Then clearly 2 Any such classifier that achieves the Bayes error is called a Bayes classifier. er h = er. 4 In the following, we will consider the k-nearest neighbor algorithm and ask whether the error of the classifier returned by k-nn approaches the Bayes error er. 2 k-nearest Neighbor Algorithm The k-nearest neighbor algorithm is one of the simplest machine learning algorithms for classification. The basic idea of the algorithm is the following: to predict the class label of a new instance, take a majority vote among the classes of the k points in the training sample that are closest to this instance. Of course, closest requires a notion of distance among instances. The most common setting is one where the instances are vectors in R n, and the distance used is the Euclidean distance. Formally, let X R n. The k-nn algorithm receives as input S = x 1, y 1,... x m, y m X ±1} m and produces a classification rule h S : X ±1} given by the following: 1 if y i w i x 0 1 if w i x w i x h S x = = i:y i:y i= 1 5 1 otherwise 1 otherwise where w i x = 1 k if x i is one of the k nearest neighbors of x 0 otherwise. In determining nearest neighbors, any fixed method can be used to break ties e.g. a common method is to take the point with smaller index as the closer neighbor: if two training points x i, x j are equidistant from the point x to be classified, x i x = x j x, then x i is treated as being closer to x than x j if i < j, and vice-versa. Note that the weights w i x defined above depend on all the training points x 1,... x m. The question of interest to us is: how does the k-nn algorithm behave in the limit of infinite data? We start with some classical results on convergence properties of the k-nn algorithm when k is fixed. Consider first the special case k = 1. Let Then we have the following classical result of Cover and Hart 1: er NN = 2E x ηx1 ηx. 7 Theorem 2.1 Cover and Hart, 1967. Let be any probability distribution on X ±1}. Then the 1-NN algorithm satisfies E S m er h S er NN as m. Moreover, the following relations hold: 2 Exercise: verify that er h er h h : X Y. er er NN 2er 1 er 2er. 6

Consistency of Nearest Neighbor Methods 3 It follows from the above result that if er = 0 i.e. the labels y are a deterministic function of the instances x, then the 1-NN algorithm is Bayes consistent w.r.t. and l 0-1. In general, however, er NN may be different from er, and 1-NN need not be Bayes consistent. A similar result holds for any fixed integer k > 1. In this case, let er knn = Then we have: E x j=0 er k 1NN k ηx j 1 ηx ηx k j 1 j < k 2 + 1 ηx 1 j > k if k is odd 2 if k is even. Theorem 2.2. Let be any probability distribution on X ±1}. Let k > 1. Then the k-nn algorithm satisfies E S m er h S er knn as m. Moreover, the following relations hold: er... er 5NN er 3NN er 1NN 2er. 8 Again, if er = 0, then the k-nn algorithm with fixed k > 1 is Bayes consistent w.r.t., but in general, even with a large but fixed value for k, the algorithm need not be Bayes consistent. Below we will see that by allowing the choice of k to vary with the number of training examples m in particular, by allowing k to increase slowly with m, the nearest neighbor algorithm can actually be made a universally Bayes consistent algorithm. This follows from the celebrated theorem of Stone 3, which was the first result to establish universal Bayes consistency for any learning algorithm for binary classification. 3 Stone s Theorem We will describe Stone s theorem for a more general class of weighted-average plug-in classification methods, of which the nearest neighbor algorithm can be viewed as a special case. 3.1 Plug-in Classifiers Recall the form of the Bayes classifier h in Eq. 3. A natural approach to learning a classifier from a sample S is then to approximate the conditional probabilities ηx via an estimate η S : X 0, 1, and to plug in this estimate in Eq. 3 to get a classifier of the form h S x = 1 if η S x 1 2 1 otherwise. Intuitively, one would expect that if η S is a good approximation to η, then the plug-in classifier h S above should also be a good approximation h, i.e. should have generalization error close to the Bayes error. This is captured in the following result: Theorem 3.1. Let be any probability distribution on X ±1}, η be the corresponding conditional probability function, η S : X 0, 1 be any estimate of η, and h S be defined as in Eq. 9. Then er h S er 2E x η S x ηx. 9

4 Consistency of Nearest Neighbor Methods Proof. We have, er h S er,y 1h S x y 1h x y E y x 1y = 1 1h S x = 1 1h x = 1 +1y = 1 1h S x = 1 1h x = 1 11 ηx 1h S x = 1 1h x = 1 +1 ηx 1h S x = 1 1h x = 1 12 2ηx 1 1h S x = 1 1h x = 1 13 2ηx 1 1h S x = 1, h x = 1 1h S x = 1, h x = 1 14 2ηx 1 1h S x h x 15 2 E x η S x ηx, 16 since h S x h x = η S x ηx ηx 1 2. Corollary 3.2. Under the conditions of Theorem 3.1, er h S er 2 E x η S x ηx 2. 10 Proof. This follows directly from Theorem 3.1 by Jensen s inequality: 2 E x η S x ηx Ex η S x ηx 2. The above results imply that in order to prove Bayes consistency of a plug-in classifier h S, it suffices to prove convergence of the corresponding conditional probability estimate η S ; in particular, it suffices to show either E S m E x η S x ηx 0 or E S m E x η S x ηx 2 0 as m. 3.2 Weighted-Average Plug-in Classifiers Consider now a specific family of plug-in classifiers termed weighted-average plug-in classifiers, which estimate the conditional probability ηx via a weighted average of the training labels. Specifically, let X R n. Given a sample S = x 1, y 1,..., x m, y m X ±1} m and an instance x X, let η S x = y iw i x, 17 where y i = y i + 1/2 0, 1}, and the weights w i x can depend on x 1,..., x m and must be non-negative and sum to one: w i x 0 i, x and m w ix = 1 x. Then the weighted-average plug-in classifier h S with weights w i x is obtained by using this estimate η S in Eq. 9. Clearly, the k-nn algorithm can be viewed as a special case of the weighted-average plug-in classifier, with weights given by w i x = 1/k if x i is one of the k nearest neighbors of x in S, and w i x = 0 otherwise. Stone s theorem establishes universal Bayes consistency of such classifiers provided the weights satisfy certain conditions; we will see that if k is allowed to vary with m such that k m and km m 0 as m, then the weights associated with the resulting k m -NN classifier satisfy the required conditions. 3 3 The k m-nn algorithm under these conditions on k m had been previously known to be Bayes consistent under certain regularity conditions on the distribution, but Stone s theorem was the first to establish universal Bayes consistency.

Consistency of Nearest Neighbor Methods 5 3.3 Consistency of Weighted-Average Plug-in Classifiers: Stone s Theorem Theorem 3.3 Stone, 1977. Let X R n, and let h S be a weighted-average plug-in classifier as above with weights w i such that for all distributions µ on X, the following conditions hold: i c > 0 such that for every non-negative measurable function f : X R with E x fx <, E x1,...,x m,x m w i xfx c E x fx ; 18 ii For all a > 0, E x1,...,x m,x m w i x 1 x i x a 0 as m ; and 19 iii E x1,...,x m,x max w ix 0 as m. 20 1 i m Then an algorithm which given S, outputs h S is universally Bayes consistent. Proof. The proof proceeds by showing that E S m E x η S x ηx 2 0 as m. 21 Towards this, let ˆη S x = w i xηx i. 22 Then η S x ηx 2 = η S x ˆη S x + ˆη S x ηx 2 23 2 η S x ˆη S x 2 + ˆη S x ηx 2, 24 since a + b 2 2a 2 + b 2. Consider first the first term in the RHS of Eq. 24; we will show this converges to 0 as m. We have, η S x ˆη S x 2 m 2 = w i xy i ηx i = j=1 w i xw j xy i ηx i y j ηx j 25 26 Now for i j, w i xw j xy i ηx i y j ηx j 1,...,x m,x E yi,y j x 1,...,x m,x w i xw j xy i ηx i y j ηx j = 0. 27

6 Consistency of Nearest Neighbor Methods Therefore we have η S x ˆη S x 2 = w i 2 xy i ηx i 2 28 = m w i 2 x wi 2 x max w m ix 1 i m j=1 = max w ix 1 i m since y i ηx i 2 1 29 w j x 30 31 32 0 as m, by condition iii. 33 The second term in the RHS of Eq. 24 can also be shown to converge to zero; this makes use of conditions i and ii see 2 for details. Together with the above, this yields the desired result. As noted above, it can be shown that with k m chosen so that k m and km m 0 as m, the k m-nn classifier satisfies the conditions of Stone s theorem, thus establishing universal Bayes consistency for such classifiers. etails can be found in 2. 4 Additional Pointers 4.1 Classification is Easier than Regression The following result holds for plug-in classifiers using a conditional probability estimate η S ηs x ηx 2 0 as m : Theorem 4.1. Let η S x be such that satisfying η S x ηx 2 0 as m, and let Then h S x = 1 if η S x 1 2 1 otherwise. er h S er 0 as m. η S x ηx 2 Note that the actual value of the above ratio cannot be bounded. Further details can be found in 2. 4.2 Slow Rates of Convergence Theorem 4.2. Let a n } be a sequence of positive numbers converging to 0, with 1 16 a 1 a 2.... Then for any classification method, there exists a distribution on X ±1} with er = 0 such that E S m er h S a m m. This implies that there are no universal rates of convergence w.r.t. er for any method; any such rates must come with some restrictions on. See 2 for further details.

Consistency of Nearest Neighbor Methods 7 5 Next Lecture In the next lecture, we will look at some recent results on consistency of learning algorithms that minimize a convex surrogate of the zero-one loss, such as support vector machines and AdaBoost. References 1 Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 131:21 27, 1967. 2 Luc evroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. 3 Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 54:595 620, 1977.