Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Size: px

Start display at page:

Download "Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2"

Noah Casey
5 years ago
Views:

1 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, Kevin Jamieson 2

2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal Bayes classifier: P(Y =1 X = x) = 1 2 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2017 Kevin Jamieson 3

3 Linear Decision Boundary Training data: True label: +1 True label: -1 Learned: Linear Decision boundary x T w + b =0 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2017 Kevin Jamieson 4

4 15 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 15 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: Kevin Jamieson 5

5 1 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 1 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: Kevin Jamieson 6

6 k-nearest Neighbor Error Bias-Variance tradeoff As k->infinity? Bias: Best possible Variance: As k->1? Bias: Variance: 2017 Kevin Jamieson 7

7 Notable distance metrics (and their level sets) L 2 norm L 1 norm (taxi-cab) Mahalanobis (here, Σ on the previous slide is not necessarily diagonal, but is symmetric L1 (max) norm Kevin Jamieson

8 1 nearest neighbor One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i 1 xj 1 )2 + (x i 2 xj 2 )2 Dist(x i,x j ) =(x i 1 xj 1 )2 +(3x i 2 3xj 2 )2 The relative scalings in the distance metric affect region shapes Kevin Jamieson

9 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1, assume the x i s become dense in R d Note: any x a 2 R d has the same label distribution as x b with b =1NN(a) [Cover, Hart, 1967] 2017 Kevin Jamieson 10

10 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1, assume the x i s become dense in R d Note: any x a 2 R d has the same label distribution as x b with b =1NN(a) If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bates error = 1 p` 2017 Kevin Jamieson 11

11 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1, assume the x i s become dense in R d Note: any x a 2 R d has the same label distribution as x b with b =1NN(a) If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bates error = 1 1-nearest neighbor error = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `= Kevin Jamieson 12

12 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1, assume the x i s become dense in R d Note: any x a 2 R d has the same label distribution as x b with b =1NN(a) If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bates error = 1 1-nearest neighbor error = = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 kx p`(1 p`) apple 2(1 p` ) `=1 As x->infinity, then 1-NN rule error is at most twice the Bayes error! [Cover, Hart, 1967] k (1 p` )2 k Kevin Jamieson 13

13 Curse of dimensionality Ex. 1 side length r X is uniformly distributed over [0, 1] p. What is P(X 2 [0,r] p )? 2017 Kevin Jamieson 14

14 Curse of dimensionality Ex. 2 {X i } n i=1 are uniformly distributed over [.5,.5]p. What is the median distance from a point at origin to its 1NN? 2017 Kevin Jamieson 15

15 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) Kevin Jamieson

16 Nearest neighbor regression {(x i,y i )}) n i=1 Why are far-away neighbors weighted same as close neighbors! Kernel smoothing: K(x, y) N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson

17 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i f(x0 b )= x i 2N k (x 0 ) P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson

18 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i b f(x0 )= Why just average them? P n i=1 K(x 0,x i )y P i n i=1 K(x 0,x i ) Kevin Jamieson

19 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) w(x 0 ),b(x 0 ) = arg min w,b bf(x 0 )=b(x 0 )+w(x 0 ) T x 0 nx K(x 0,x i )(y i (b + w T x i )) 2 i=1 Local Linear Regression Kevin Jamieson

20 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) 2017 Kevin Jamieson 21

21 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) 2017 Kevin Jamieson 22

22 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) With a lot of data, local methods have strong, simple theoretical guarantees. With not a lot of data, neighborhoods aren t local and methods suffer Kevin Jamieson 23

23 Kernels Machine Learning CSE546 Kevin Jamieson University of Washington October 26, Kevin Jamieson 24

24 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d Learning a model s parameters: Each `i(w) is convex. y i 2 R nx i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson

25 What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p Feature space can get really large really quickly! 2017 Kevin Jamieson 26

26 Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v Kevin Jamieson 27

27 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v Kevin Jamieson 28

28 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 General d : Dimension of (u) is roughly p d if u 2 R p 2017 Kevin Jamieson 29

29 Observation bw = arg min w There exists an 2 R n : bw = nx (y i x T i w) 2 + w 2 w i=1 nx i x i i=1 Why? 2017 Kevin Jamieson 30

30 Observation arg min K y T K 2017 Kevin Jamieson 31

31 Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid Kevin Jamieson 32

32 Mercer s Theorem When do we have a valid Kernel K(x,x )? Definition 1: when it is an inner product Mercer s Theorem: K(x,x ) is a valid kernel if and only if K is a positive semi-definite. PSD in the following sense: 2017 Kevin Jamieson 33

33 RBF Kernel K(u, v) =exp u v Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights Is there an inner product representation of K(x,y)? 2017 Kevin Jamieson 34

34 Classification min,b nx bw = max{0, 1 y i (b + x T i w)} + w 2 2 i=1 nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i i=1 j=1 i,j= Kevin Jamieson 35

35 RBF kernel Secretly random features 2 cos( ) cos( ) = cos( + ) + cos( ) b uniform(0, ) w N (0, 2 ) (x) = p 2 cos(w T x + b) E w,b [ (x) T (y)] = 2017 Kevin Jamieson 36

36 RBF kernel Secretly random features 2 cos( ) cos( ) = cos( + ) + cos( ) b uniform(0, ) w N (0, 2 ) (x) = p 2 cos(w T x + b) E w,b [ (x) T (y)] = e x y 2 2 [Rahimi, Recht 2007] Hint: use Euler s formula e jz = cos(z)+j sin(z) 2017 Kevin Jamieson 37

37 Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 What about sparsity? 2017 Kevin Jamieson 38

38 String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2017 Kevin Jamieson 39

39 Least squares, tradeoffs 2017 Kevin Jamieson 40

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data: