Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

George Benson
6 years ago
Views:

1 Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University

2 Outline Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows

3 Parzen Windows Parzen

4 Density Estimation Observe some data x i Want to estimate p(x) Find unusual observations (e.g. security) Find typical observations (e.g. prototypes) Classifier via Bayes Rule p(y x) = p(x, y) p(x) = P p(x y)p(y) y 0 p(x y 0 )p(y 0 ) Need tool for computing p(x) easily

5 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female

6 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female

7 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female

8 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female not enough data Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female

9 Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension

10 Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension

11 Density Estimation sample 0.04 underlying density Continuous domain = infinite number of bins Curse of dimensionality 10 bins on [0, 1] is probably good bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by

12 Bin Counting

13 Bin Counting

14 Bin Counting

15 Bin Counting can t we just go and smooth this out?

16 What is happening? Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 For any average of [0,1] iid random variables. Bin counting Random variables x i are events in bins Apply Hoeffding s theorem to each bin Take the union bound over all bins to guarantee that all estimates converge

17 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp 2m 2 good news Solving for error probability 2 A apple exp( m 2 )=) apple r log 2 A 2m log

18 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log

19 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 bins not Applying the union bound and independent Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log

20 Bin Counting

21 Bin Counting can t we just go and smooth this out?

22 Parzen Windows Naive approach Use empirical density (delta distributions) p emp (x) = 1 m mx i=1 x i (x) This breaks if we see slightly different instances Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x ) satisfying Z X k x (x 0 )dx 0 = 1 for all x

23 Parzen Windows Density estimate p emp (x) = 1 m ˆp(x) = 1 m mx i=1 mx x i (x) k xi (x) Smoothing kernels i= Gauss Laplace Epanechikov Uniform (2 ) 1 2 e 1 2 x2 1 2 e x 3 4 max(0, 1 x2 ) [ 1,1] (x)

24 Smoothing

25 Smoothing dist = norm(x - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))

26 Smoothing

27 Smoothing

28 Size matters

29 Size matters Shape matters mostly in theory Kernel width Too narrow overfits Too wide smoothes with constant distribution How to choose? k xi (x) =r d h x r xi

30 Model Selection

31 Maximum Likelihood Need to measure how well we do For density estimation we care about Pr {X} = my p(x i ) i=1 Finding a that maximizes P(X) will peak at all data points since x i explains x i best... Maxima are delta functions on data. Overfitting!

32 Overfitting Likelihood on training set is much higher than typical

33 Overfitting density 0 Likelihood on training set is much higher than typical. density

34 Underfitting Likelihood on training set is very similar to typical one. Too simple

35 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1

36 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy

37 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy wasteful

38 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n difficult i=1 easy wasteful

39 Model Selection Leave-one-out Crossvalidation Use almost all data to estimate density. Use single instance to estimate how well it works 1 X log p(x i X\x i ) = log k(x i,x j ) n 1 This has huge variance Average over estimates for all training data Pick the parameter that works best Simple implementation 1 n nx i=1 log apple n n 1 p(x i) j6=i 1 n 1 k(x i,x i ) where p(x) = 1 n nx k(x i,x) i=1

40 Leave-one out estimate

41 Optimal estimate

42 Model Selection k-fold Crossvalidation Partition data into k blocks (typically 10) Use all but one block to compute estimate Use remaining block as validation set Average over all validation estimates kx 1 l(p(x i X\X i )) k i=1 Almost unbiased (e.g. via Luntz and Brailovski, 1969) (error is for (k-1)/k sized set) Pick best parameter (why must we not check too many?)

43 Watson Nadaraya Estimator Geoff Watson

44 From density estimation to classification Binary classification Estimate p(x y = 1) and p(x y = 1) Use Bayes rule p(y x) = p(x y)p(y) p(x) = P 1 m y y i =y k(x i,x) my m P 1 m i k(x i,x) Decision boundary p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) local weights

46 Watson-Nadaraya Classifier

47 Watson-Nadaraya Classifier dist = norm(x - x * ones(1,m),'columns'); f = sum(y.* exp(-0.5 * dist.**2));

48 Watson Nadaraya Regression Binary classification p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) Regression - use same weighted expansion ŷ(x) = X j y j k(x j,x) P i k(x i,x) labels local weights

50 Watson-Nadaraya regression estimate

51 Silverman s Rule Bernard Silverman

52 Silverman s rule Chicken and egg problem Want wide kernel for low density region Want narrow kernel where we have much data Need density estimate to estimate density Simple hack Use average distance from k nearest neighbors r i = r k X kx i xk x2nn(x i,k) Nonuniform bandwidth for smoother.

53 Density true density

54 non adaptive estimate

55 adaptive estimate

56 distance distribution

57 Nearest Neighbor

58 Nearest Neighbors Table lookup For previously seen instance remember label Nearest neighbor Pick label of most similar neighbor Slight improvement - use k-nearest neighbors For regression average Really useful baseline! Easy to implement for small amounts of data.

59 Relation to Watson Nadaraya Watson Nadaraya estimator ŷ(x) = X j y j k(x i,x) P y j w j (x) i k(x i,x) = X j Nearest neighbor estimator ŷ(x) = X j y j k(x j,x) P y j w j (x) i k(x i,x) = X j Neighborhood function is hard threshold.

60 1-Nearest Neighbor

61 4-Nearest Neighbors

62 4-Nearest Neighbors Sign

63 If we get more data 1 Nearest Neighbor Converges to perfect solution if separation Twice the minimal error rate 2p(1-p) for noisy problems k-nearest Neighbor Converges to perfect solution if separation (but needs more data) Converges to minimal error min(p,1-p) for noisy problems (use increasing k)

64 1 Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > d/n Probability that at least one point of n is in this neighborhood is 1-e -d so we can make this small Assume that probability mass doesn t change much in neighborhood Probability that labels of query and point do not match is 2p(1-p) (up to some approximation error in neighborhood)

65 k Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > dk/n Small probability that we don t have at least k points in neighborhood. Assume that probability mass doesn t change much in neighborhood Bound probability that majority of points doesn t match majority for p (e.g. via Hoeffding s theorem for tail). Show that it vanishes Error is therefore min(p, 1-p), i.e. Bayes optimal error.

66 Fast lookup KD trees (Moore et al.) Partition space (one dimension at a time) Only search for subset that contains point Cover trees (Beygelzimer et al.) Hierarchically partition space with distance guarantees No need for nonoverlapping sets Bounded number of paths to follow (logarithmic time lookup)

67 Summary Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows

68 Further Reading Cover tree homepage (paper & code) (kd trees, original paper) (Andrew Moore s tutorial from his PhD thesis) Nadaraya s regression estimator (1964) Watson s regression estimator (1964) Watson-Nadaraya regression package in R Stone s k-nn regression consistency proof Cover and Hart s k-nn classification consistency proof Tom Cover s rate analysis for k-nn Rates of Convergence for Nearest Neighbor Procedures. Sanjoy Dasgupta s analysis for k-nn estimation with selective sampling Multiedit & Condense (Dasarathy, Sanchez, Townsend) Geometric approximation via core sets

An Introduction to Machine Learning

An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January