Exploiting k-nearest Neighbor Information with Many Data

Size: px

Start display at page:

Download "Exploiting k-nearest Neighbor Information with Many Data"

Ralph Scott
5 years ago
Views:

1 Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP (Tue.) Yung-Kyun Noh Robotics Lab.,

2 Contents Nonparametric methods for estimating density functions Nearest neighbor methods Kernel density estimation methods Metric learning for nonparametric methods Generative approach for metric learning Theoretical properties and applications 2

3 Representation of Data Data space Each datum is one point in a data space =[1, 2, 5, 10, ] T 3

4 Classification 4

5 Classification with Nearest Neighbors : class 1 : class 2 Data space Use majority voting (k-nearest neighbor classification) k = 9 (five / four ) Classify a testing point ( ) as class 1 ( ). 5

6 Nearest Points ship ship ship airplane ship 6 ship deer ship ship ship

7 Nearest Points automobile truck cat ship ship ship ship ship automobile automobile 7

8 Bayes Classification Bayes classification using underlying density functions: Optimal Error: Bayes risk In general, we do not know the underlying density. 8

9 Nearest Neighbors and Bayes Classification Surrogate method of using underlying density functions. Count nearest neighbors! 9

10 Tomas M. Cover (8/7/1938~3/26/2012) BS. in Physics from MIT Ph.D. in EE from Stanford Professor in EE and Statistics, Stanford Peter E. Hart (Bone c. 1940s) MS., Ph.D. from Stanford A strong advocate of artificial intelligence in industry Currently Group Senior Vice President at the Ricoh Company, Ltd. 10

11 Early in 1966 when I first began teaching at Stanford, a student, Peter Hart, walked into my office with an interesting problem. Charles Cole and he were using a pattern classification scheme which, for lack of a better word, they described as the nearest neighbor procedure. The proper goal would be to relate the probability of error of this procedure to the minimal probability of error namely, the Bayes risk. 11

12 Nearest Neighbors and Bayes Risk, uniformly! 1-NN error k-nn error [T. Cover and P. Hart, 1967] 12

13 Nearest Neighbor Classification and Accuracy Nearest neighbor classification with two-class data from two different random Gaussians 13

14 Metric Dependency of Nearest Neighbors Different metric changes class belongings Classified as red Classified as blue Mahalanobis-type distance: 14

15 Conventional Idea of Metric Learning Class 1 Class 2 Class 1 Class 2 15

16 Many Data Situation with Overlap 16

hypersphere x NN is the nearest neighbor of x i 2 N 2 EX

17 Consider Finite Sampling Situation x R D x NN d N x i Find expectation of p( x NN ) on the surface of hypersphere x NN is the nearest neighbor of x i 2 N 2 EX p( xnn) dn, xi p( xi) p NN Expectation over NN distribution d 2D x x i 17

18 Bias of Nearest Neighbor Classification When nearest neighbor appears at a distance d. Test point x NN point Test point x NN point 18

19 Bias in the Expected Error Assumption: A nearest neighbor appears at nonzero : Asymptotic NN Error Metric variant terms 2: Residual due to Finite Sampling. R. R. Snapp et al. (1998) Asymptotic expansions of the k nearest neighbor risk, The Annals of Statistics Y.-K. Noh et al. (2010) Generative local metric learning for nearest neighbor classification, NIPS 19

20 Conventional Metric Learning 20

21 Generative Local Metric Learning (GLML) 20% increase 21

22 Synthetic Data [Non-Gaussian] Data of two dimensionality are shown below, and data occupying other dimensionality are isotropic Gaussian noise. 22

Performance Performance Various Datasets comparison with discriminative metric learning 0.66 0.65 0.64 0.63 0.62 0.96 0.94 0.92 0.9 100 150 200 250 # tr. data German Performance Performance 0.86 0.

23 Performance Performance Various Datasets comparison with discriminative metric learning # tr. data German Performance Performance # tr. data Waveform # tr. data # tr. data Performance # tr. data Ionosphere # tr. data Twonorm USPS 8x8 TI46 Performance [Y.-K. Noh et al., 2010] 23

24 Image Data Classification with Convolutional Neural Networks (AlexNet) Caltech101 Caltech256 24

25 Manifold Embedding (Isomap) Use Dijkstra algorithm to calculate the manifold distance from nearest neighbor distance MDS using manifold distance ( X ) 25

26 Manifold Embedding (Isomap) 26

27 Isomap with LMNN Metric 27

28 Isomap with GLM Metric 28

29 Nadaraya-Watson Estimator by N (x) = P N i=1 K(x i; x)y i P N j=1 K(x j; x) D = fx i ;y i g N i=1 x i 2 R D K (x i ; x) =K = μ jjxi xjj 1 p 2¼ D h D h exp μ 1 2h jjx 2 i xjj 2 K μ jjxi xjj h jjx i xjj y i 2f0; 1g Classification y i 2 R Regression 29

30 Kernel regression (Nadaraya-Watson regression) with metric learning D = fx i ;y i g N i=1 μ jjxi xjj K (x i ; x) =K h μ 1 = p exp 1 D 2¼ h D 2h jjx 2 i xjj 2 y 1 y 2 x 2 x 1 x x 4 x x 5 3 y(x)? by N (x) = P N i=1 K(x i; x)y i P N i=1 K(x i; x) y 3 y 4 y 5 x i ; x 2 R D by 5 (x) = K(x 1 ; x) P 5 i=1 K(x i; x) y 1+ K(x 2; x) P 5 i=1 K(x i; x) y 2+ K(x 3; x) P 5 i=1 K(x i; x) y 3+ K(x 4; x) P 5 i=1 K(x i; x) y 4+ K(x 5; x) P 5 i=1 K(x i; x) y 5 30

31 Kernel regression (Nadaraya-Watson regression) with metric learning D = fx i ;y i g N i=1 K (x i ; x; A) =K = 1 p 2¼ D h D μ jjxi xjj A exp μ h 1 2h (x 2 i x) > A(x i x) y 1 y 2 x 2 x 1 x x 4 x x 5 3 y(x)? by N (x) = P N i=1 K(x i; x; A)y i P N i=1 K(x i; x; A) y 3 y 4 y 5 A 31

32 Nadaraya-Watson Regression is Asymptotically Optimal by N (x) = P N i=1 K(x i; x)y i P N j=1 K(x j; x) lim by N(x) =E p(yjx) [y] N!1 (h! 0) Minimizes mean square error (MSE) Metric independent asymptotic property 32

33 Mean Square Error with Finite Samples by N (x) = P N i=1 K h(x i ; x)y i P N j=1 K h(x j ; x) 1 N tst NX tst (by N (x j ) y j ) 2 j=1 33

34 Bandwidth selection Bias E [by(x) y(x)] = h 2 μ r > p(x)ry(x) p(x) Variance + r 2 y(x) 2 + o(h 4 ) E[(by(x) E[by(x)]) 2 ] = 1 Nh D (2 p ¼) D " ¾ 2 y (x) p(x) + h2 Ã! ¾ 2 y (x) 4 p(x) 2 r 2 p + (ry(x))2 p(x) + o(h 4 ) # Conventional methods focused on finding an optimal bandwidth h for minimizing MSE. Typically optimal bandwidth is determined to balance the tradeoff between bias 2 and variance. 34

35 Bandwidth selection Bias E [by(x) y(x)] = h 2 μ r > p(x)ry(x) p(x) Variance + r 2 y(x) 2 + o(h 4 + ) Bias 2 + MSE1 + Bias 2 E[(by(x) E[by(x)]) 2 ] = 1 Nh D (2 p ¼) D " ¾ 2 y (x) p(x) + h2 Ã! ¾ 2 y (x) 4 p(x) 2 r 2 p + (ry(x))2 p(x) + o(h 4 ) # Conventional methods focused on finding an optimal bandwidth h for minimizing MSE. Typically optimal bandwidth is determined to balance the tradeoff between bias 2 and variance. 35

36 For Gaussian: Bias E [by(x) y(x)] = h 2 μ r > p(x)ry(x) p(x) + r 2 y(x) 2 r 2 y(x) =0 Linear y(x) + o(h 4 ) Variance E[(by(x) E[by(x)]) 2 ] = 1 Nh D (2 p ¼) D " ¾ 2 y (x) p(x) + h2 Ã! ¾ 2 y (x) 4 p(x) 2 r 2 p + (ry(x))2 p(x) Simply ignore all variance + o(h 4 ) # 1) With Gaussian, we only consider the first term because y(x) is linear. 2) We simply ignore variance minimization. We will explain the reason later. 36

37 For x & y Jointly Gaussian Learned metric is not sensitive to the bandwidth 37

38 Benchmark Data 38

39 Two Theoretical Properties for Gaussians The existence of a symmetric positive definite matrix A that eliminates the first term of the r bias, > p(x)ry(x). p(x) With optimal bandwidth h minimizing the leading order terms, the minimum mean square error is the square of bias in infinitely high-dimensional space. 39

40 Diffusion Decision Model Choosing between two alternatives under time pressure with uncertain information. evidence z class 1 class 1 Speed-accuracy tradeoff Increase speed Increase accuracy -z class 2 T Increase speed Increase accuracy 40

41 Sequential Sampling Methods Sequential sampling methods for optimal decision making Response after some time (~1s) LIP J. M. Beck. et al. (2008) Signals from two neurons of different receptive fields 41

42 Equivalence Principle DDM (Diffusion Decision Model) for comparing two Poisson processes k-nearest neighbor classification in the asymptotic situation for 2 classes Neuron Neuron Neuron 42

43 Diffusion Decision Processes Implementation of diffusion decision model with NN classification (PV criterion) Confidence level 0.8 Confidence level

44 Experimental Results Uniform probability densities λ 1 =0.8, λ 2 = [Noh et al. 2012]

45 Race Model 45

46 Effect of Number of Data Two Gaussian Data Number of data: Number of data: 46

47 CIFAR-10 Images Number of data: 1000/class 32x32 color images 47

48 Summary Nearest neighbor methods and asymptotic property Naradaya-Watson regression with metric learning Diffusion decision making and nearest neighbor methods 48

49 THANK YOU Yung-Kyun Noh 49

Machine Learning and Related Disciplines

Machine Learning and Related Disciplines The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 8-12 (Mon.-Fri.) Yung-Kyun Noh Machine Learning Interdisciplinary