Nearest Neighbor Pattern Classification

Similar documents
Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Pattern Classification

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayesian Decision Theory

Bayes Decision Theory - I

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Machine Learning 2017

Metric-based classifiers. Nuno Vasconcelos UCSD

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Consistency of Nearest Neighbor Methods

1 + lim. n n+1. f(x) = x + 1, x 1. and we check that f is increasing, instead. Using the quotient rule, we easily find that. 1 (x + 1) 1 x (x + 1) 2 =

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Decision Theory

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Pattern Recognition. Parameter Estimation of Probability Density Functions

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

probability of k samples out of J fall in R.

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Lecture 1: Bayesian Framework Basics

STA 414/2104, Spring 2014, Practice Problem Set #1

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probability Models for Bayesian Recognition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

STAT215: Solutions for Homework 2

Lecture 8 October Bayes Estimators and Average Risk Optimality

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Deep Learning for Computer Vision

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

BAYESIAN DECISION THEORY

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Lecture 2 Machine Learning Review

Machine Learning Linear Classification. Prof. Matteo Matteucci

Contents 2 Bayesian decision theory

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Machine Learning 4771

Statistical methods in recognition. Why is classification a problem?

Mining Classification Knowledge

44 CHAPTER 2. BAYESIAN DECISION THEORY

5. Discriminant analysis

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Supervised Learning: Non-parametric Estimation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Parametric Techniques

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

MIRA, SVM, k-nn. Lirong Xia

Lecture 2: Basic Concepts of Statistical Decision Theory

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Announcements. Proposals graded

CMU-Q Lecture 24:

Parametric Techniques Lecture 3

The Perceptron algorithm

Lecture 3: Introduction to Complexity Regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Curve learning. p.1/35

Lecture notes on statistical decision theory Econ 2110, fall 2013

Minimum Error-Rate Discriminant

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

What does Bayes theorem give us? Lets revisit the ball in the box example.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

10-701/ Machine Learning, Fall

L11: Pattern recognition principles

Lecture 12 November 3

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Expect Values and Probability Density Functions

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Bayesian Decision Theory Lecture 2

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning, Fall 2012 Homework 2

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Course 212: Academic Year Section 1: Metric Spaces

Undirected Graphical Models

Chapter 2 Bayesian Decision Theory. Pattern Recognition Soochow, Fall Semester 1

FORMULATION OF THE LEARNING PROBLEM

Maximum Likelihood Estimation. only training data is available to design a classifier

Bayes Decision Theory

Distribution-specific analysis of nearest neighbor search and classification

Bayes Decision Theory

ST5215: Advanced Statistical Theory

Naïve Bayes classification

Classification and Pattern Recognition

Lecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs

Pattern Recognition 2

Probabilistic Methods in Bioinformatics. Pabitra Mitra

The Bayes classifier

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 7 Introduction to Statistical Decision Theory

Learning with multiple models. Boosting.

Statistical Machine Learning Lecture 1: Motivation

Mining Classification Knowledge

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Econ 2148, spring 2019 Statistical decision theory

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Transcription:

Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified observation ( incoming test sample) the class/category/label of the nearest sample ( using metric) in training set. In this paper we shall show that in large sample case or number of training set approach infinity, the probability of error to NN is bounded by ( lower bound) R Bayes probability of error of classification and by upper bound 2R (1 R ) 2 The Nearest Neighbor Rule A set of n pairs is given (x 1, θ 2 ),...(x n, θ n ) s.t the x is take value in a metric space X upon which is defined a metric d. The category θ i is assigned to the ith sample or individual from finite subset {1,2,... M} and x i is the measurement made upon that ith. we shall say x i belongs to θ i when we mean precisely that the ith sample have measurement x i with category θ i. Our Goal is to classify the measurement x, for a new arriving pair (x, θ), the classification is procedure to estimate the θ since x is observable. The estimation of θ is done by utilizing the fact that we have set of n correctly classified points. We shall call : nearest neighbor of x if x n (x 1, x 2, x 3,...x n ) min d(x i, x) = d(x n, x) i = 1, 2, 3..., n. 1

In this case the decision of the nearest neighbor rule is assigning the category θ n to our new measurement x. The 1-NN rule decide x belongs to the category of nearest neighbor and ignored the others!. In general k-nn rule decide x belongs to the category of majority vote of the nearest k neighbors. Example : Given M=2 (i.e 2 categories),green squares, blue hexagon,a new arriving measurement (i.e the red star) and we desire to classify it. 1-NN rule : it will classified as blue hexagon. 3-NN rule : it will classified as blue hexagon. 5-NN rule : it will classified as green square. 3 The Admissibility Of Nearest Neighbor Rule If we have large number of samples it makes good sense to use instead of classifying based on the nearest neighbor (1-NN), we can use the idea of classifying depend on the majority vote of the nearest K neighbors(k-nn). 2

(What is the problem using K very large, What is the problem using small K?). (*)picking a very large K will decrease the effect of nearest or distance factor s.t the label will assigned to a new sample depend on whole system. If we take K equal to number of samples, then each new coming sample will labeled depend on the majority of exist labeled samples. ( We can solve with weighted knn). (*)picking a very small K will let outliers affect the label system. (What is the problem using K even? ). The purpose of this section show that among the class the k-nn rules, single nearest neighbor rule (1-NN) is admissible because th single-nn has strictly lower probability than any other k-nn. Example : (*)The prior probability η 1 = η 2 = 1/2, the probability of choosing the densities functions. (*)The conditional densities f 1, f 2 are uniform over unit disk D 1, D 2 with centers ( 3, 0), (3, 0). Figure 1: Admissibility of nearest neighbor rule. The probability that j individuals from n samples come from category 1, and hence have measurements lying in D 1 is : (1/2) n ( ) n j Without loss of generality,assume that the unclassified x lies in category 1, Then the 1-NN make a classification error only if the nearest neighbor x n belongs to category 2 and thus lies in D 2. But from inspection of the distance relationships, if the nearest neighbor to x in D 2, then each x i must lie in D 2. Why? 1 - if we assume having at least one x i s from n-samples in D 1, then 1-NN will not make classification error, because it will classify our x according to the assumption of x that lies in D 1, because these x is one of them is the nearest. (as Shown in Fig.2) 3

Figure 2: x in D 1 with another sample 1-NN rule will classify x as 1 2- making a classification error we need the compliment assumption s.t nonof the samples or x i s are exist in D 1, then 1-NN will classify the x according to nearest which exist in D 2.(as Shown in Fig.3). to get such error we need that all individuals came from category 2. Thus the probability P e (1; n) of error of the NN rule in this case is precisely (1/2) n, which mean the probability that all x 1, x 2,...x n all lie in D 2. Figure 3: x is D 1 1-NN rule will classify x as 2 in general Let k = 2k 0 + 1, k is odd number,then k-nn rule make error if k 0 or fewer points lie in D 1,This occurs with probability.: P e (k; n) = (1/2) n k 0 j=0 ( n j). (why?) The explanation same as above because k is odd number and k 0 is smaller than half of k, and k-nn rule depend on majority vote then if there are k 0 or fewer points in D 1 it will give error in classification. as we see that the 1-NN rule has strictly lower P e than any of other k-nn rule, the probability of error increase when k is increased, then we prefer to choose 1-NN (as Shown in Fig.4) In last we have some points : 1- P e (k; n) 1/2 in k, for any n.(why?) (as Shown in Fig.4) 2- P e (k; n) 0 in n, for any k > 0. (as Shown in Fig.5) 4

Figure 4: P e (k; n) as function of k P e is monotonically increasing and bounded by 1/2 in worst case, which mean pure guessing. 5

Figure 5: P e (k; n) as function of k P e for specific K the error function P e is decreasing when n is increasing 6

3- P e (k n ; n) 0 if 0 < k n /n α < 1 for all n. The limit of P e (k n ; n) is zero when the n approach infinite or big numbers of samples.( why? proof derived from definition of P e (k n ; n)). In general then 1-NN rule is strictly better than other version of k-nn in those cases where the supports of the densities f 1, f 2,...f M are such that each in-class have distance that greater than any between-class distance. 4 Bayes Procedure In Bayesian Decision Theory the Basic Idea to o minimize errors, is choosing the least risky class, i.e. the class for which the expected loss is smallest. minimizing the probability of error in classifying a given observation x into one of M categories/classes, and all the statistic will be assumed known. Let x denote the measurements on an individual and X the sample space of possible values of x, we shall refer to x as observation. we aim to build a decision for x to one of the M categories/classes which give us the least probability error. For the purposes of defining the Bayes risk : 4.1 2-class Bayes decision : sea-bass and salmon 1 - Let η 1, η 2,...η M, η i 0, M i=1 η i = 1 prior probability of the M categories.. For our problem : Let ω 1 = sea bass, ω 2 = salmon. P (ω 1 ) is the probability to get sea-bass. P (ω 2 ) is the probability to get salmon.. P (ω 1 ) + P (ω 2 ) = 1. 2 - We assume class(category)-conditional probability density function(p.d.f) are known. P (x θ i ) = f i (x): is the probability of x to be category i, it s calculate the likelihood of observation x being in class i. (as Shown in Fig.6) For our problem : we can take any property, for example we take the weight as measurements (x). in this case the likelihood, what is the probability of observed/measured weight (x) for given fish type. 3- The posterior probability, posterior probability for a class/category: the probability of a class given the observation. For our problem : how much the observed weight describe the type of fish.!! What is the probability of class for given weight (x). Let η i (x) = P (θ i x) the posterior probability,by the Bayes theorem : Posterior = likelhood prior evidence. 7

Figure 6: conditional p.d.f over observation x η i (x) η i f i (x) = M i=1 η i f i (x) /*the denominator using Law of total probability*/ Example : ( as show in Figure 6) x = 9 give very strong indication for ω 2 than ω 2,while intersection between 2 graphs don t give any indication about type of class. Example : ω 1, ω 2 are two classes with prior probability 2/3, 1/3 respectively. Figure 7: Posterior Probability These graph calculated using likelihood graph in (as Shown in Fig.7) and the prior probability and evidence. Let L(i, j) be the loss incurred by assigning an individual from category i to 8

category j. The cost associated with taking decision j with being i the correct category/class. What is the loss if we choose for observed x is salmon while in real it s sea-bass (or vice-versa). if the statistician decide to place an individual with measurement x (observation) into category j, then the conditional loss/risk or the expected loss is : r j (x) = M i=1 η i(x) L(i, j). We have all these decision r 1 (x), r 2 (x),...r m (x) every decision have its own loss. In our case r salmon (x) or r sea bass (x). For a given x the conditional loss is minimum when the observation or individual is assigned to the category j, r j (x) is the lowest among all other decisions. Bayes decision rule is given by deciding the category j for which r j (x) is the lowest,which mean we drop from the loss conditional is η i (x) L(i, j) the biggest value and we remain with small loss.. the conditional(for specific observation x ) Bayes risk is : r (x) = min i { M i=1 } η i (x) L(i, j). Overall risk : Suppose we have function α(x) that determine for each individual x a general decision rule ( i.e a category 1,...M). r α(x) ( the conditional risk/loss with respect to α). The overall risk or the expectation of loss/risk with respect to α is defined as: R α(x) = E[r α(x) ] = r α(x) (x) p(x)dx. Overall Bayes risk: instead of using random function to determine the category, if we use the function that return lowest loss/risk (Bayes rule) we get Bayes risk and it s the expectation of loss with respect to Bayes rule decision r (x) we get optimal or best results. minr = R = r (x) p(x)dx = E[r (x)]. 5 Convergence Of Nearest Neighbors We first prove that the nearest neighbor of x converges almost to x as the training size grows to infinity. Theorem Convergence of 1-NN if x, x 1, x 2,..., x n are i.i.d set in a separable metric space X, x n is the nearest neighbor to x. Proof : 9

Let S x (r) be the sphere centered at x, of radius r. x X : r d(x, x), d is metric defined of X. ** The probability that nearest neighbor if x does not fall in the S x (δ), δ > 0 is the probability that no point in the training set fall within such sphere. Probability of point to falls inside is the sphere is x S x(δ) p(x )dx - sum over all points that in sphere. P r{d(x n, n) > δ} = P r{x n / S x (δ)} = (1 P r{s x (δ)}) n 0 We can conclude that this property will more hard to be satisfied when our training set size are going to infinity 6 Nearest Neighbors and Bayes Risk Let x n x 1, x 2,..x n be the nearest to let θ n be the class/category that belong to the individual x n The loss of assigning θ as θ n is L(θ, θ n) We define the n-sample NN risk R(n) by the expectation of loss R(n) = E[L(θ, θ n)] and the in large sample R = lim n R(n) 6.1 M = 2 2-category 0,1 binary classification problem the probably of error criterion given by 0-1 matrix loss. [ ] 0 1 L = 1 0 L counts an error whenever a mistake in classification is made. Theorem : 2-category Let X be a separable metric space, f 1, f 2 be likelihood probability with sum equal to one f 1 (x) + f 2 (x) = 1, Then the NN risk R ( probability of error) has the bounds 2R (1 R ) R R Proof : Let us take the random variable x and x n in the n-samples NN problem, The conditional NN risk ( conditional expectation ) r(x, x n) is given upon θ, θ n by : r(x, x n) = [L(θ, θ n) x, x n] = P r{θ θ n x, x n} = P r{θ = 1 x}p r{θ n = 2 x n} + P r{θ = 2 x}p r{θ n = 1 x n} As we see that P r{θ = 1 x} is the posterior probability denoted by η 1 (x) then we can rewrite equation above : r(x, x n) = η 1 (x) η 2 (x n) + η 1 (x n) η 2 (x) 10

By the lemma x n convergence to the random variable x with probability one.hence with probability one, η(x n) η(x) then the conditional risk, with probability one can be written : r(x, x n) r(x) = 2 η 1 (x) η 2 (x) where r(x) is the limit of to n-sample conditional NN risk; the conditional Bayes risk is: r (x) = min{ η 1 (x), η 2 (x)} = min{ η 1 (x), 1 η 1 (x)}. when η 1 (x) + η 2 (x) = 1; Now, we can write the conditional risk of NN with r (x) r(x) = 2 η 1 (x) η 2 (x) = 2 η 1 (x) (1 η 1 (x)) = 2r (x)(1 r (x)) till this point, we have shown in large sample case with probability 1, a random observation x will correctly classified with probability 2r (x)(1 r (x)). we have shown in large sample case, that with probability one a randomly chosen x will be correctly classified with probability 2r (x)(1 r (x)). For the overall NN risk R we have, by the definition R = lim n E[r(x, x n)] so applying dominated convergence theorem, we ca switch order of the limit and expectation and get : from applying the limit,yields R = E[lim n r(x, x n)] R = E[r(x)] = E[2 η 1 (x) η 2 (x)] = E[2r (x)(1 r (x))] Proof of upper-bound : R = E[2r (x)(1 r (x))] = 2E[r (x)] 2E[r (x) 2 ] Then we can rewrite it : V ar(x) = E[X 2 ] E[X] 2 2E[r (x) 2 ] = 2(V ar(r (x)) + E[r (x)] 2 ) R = 2E[r (x)] 2(V ar(r (x)) + E[r (x)] 2 ) 11

Depend on the definition Bayes risk we get : E[r (x)] = R R = 2R 2R 2 2V ar(r (x)) The variance V ar(r (x)) 0 R 2R (1 R ) Proof of lower-bound : R = E[2r (x)(1 r (x))] = 2E[r (x)+r (x)(1 2r (x))] = R +E[r (x)(1 2r (x))] R 12