Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March 7, 2012 with slides from Carlos Guestrin & Bishop 2 What is Being Learned? Space of ML Problems Type of Supervision (eg, Experience, Feedback) Labeled Examples Discrete Classification Function Continuous Regression Function Policy Apprenticeship Learning Reward Reinforcement Learning Nothing Clustering Nonparametric Methods Parametric models restricted to specific forms May not be a good fit Nonparametric methods make few assumptions Often work extremely well in practice 3 Todo Histograms Bishop approach is too theoretical don t do it! Histogram methods partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. Often, the same width is used for all bins, i =. acts as a smoothing parameter. In an Ddimensional space, using M bins in each dimension will require M D bins! Why is this bad? 5 1

Nonparametric Methods Nonparametric Methods (3.5) Assume observations drawn from a density p(x) and consider a small region R containing x such that The probability that K out of N observations lie inside R is Bin(K N,P ) and if N is large If V (the volume of R) is sufficiently small, p(x) is approximately constant over R and Thus Hold K fixed, determine V from data Knearestneighbour approach Hold V fixed, determine K from data Kerneldensity estimation Kernel Density Estimation Fix V, estimate K from the data. Let R be a hypercube centred on x Define the kernel function (Parzen window) Kernel Density Estimation To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian It follows that and hence Any kernel such that h acts as a smoother. will work. Nearest Neighbour Density Estimation Fix K, estimate V from data. Consider a hypersphere centred on x Let it grow to a volume, V* that includes K of N data points. Then Pros / Cons Nonparametric models (besides histograms) Requires storing and computing with the entire data set. Parametric models, once fitted, Much more efficient i in terms of storage and computation. K acts as a smoother. 2

KNearestNeighbours for Classification KNearestNeighbours for Classification Given a data set with N k data points from class C k and, we have and correspondingly Since, Bayes theorem gives K = 3 K = 1 KNearestNeighbours for Classification Linear Regression: What can go wrong? K acts as a smoother As N, the error rate of the 1nearestneighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions). What do we do if the bias is too strong? Might want the data to drive the complexity of the model! Try instancebased Learning (a.k.a. nonparametric methods)? Using data to predict new data Nearest neighbor with Lots of Data Y X 18 3

Univariate 1Nearest Neighbor Given datapoints (x 1,y 1 ) (x 2,y 2 )..(x N,y N ),where we assume y i =f(x i ) for some unknown function f. Given query point x q, your job is to predict yˆ f x q Nearest Neighbor: 1. Find the closest x i in our set of datapoints 2. Predict yˆ nn argmin x i xq i y i nn Here s a dataset with one input, one output and four datapoints. y x i Here, this is the closest datapoint 1Nearest Neighbor is an example of. Instancebased learning A function approximator that has been around since about 1910. To make a prediction, search database for similar datapoints, and fit with the local points. x 1 y 1 x 2 y 2 x 3 y 3 To define an instancebased learner, specify four things: A distance metric How many nearby neighbors to look at? A weighting function (optional) How to fit with the local points? x n y n 19 1Nearest Neighbor To define an instancebased learner, specify four things: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? One 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the same output as the nearest neighbor. Multivariate 1NN examples Classification Regression 8 8 21 7 9 13 7.2 12 9 5 8 11 1 3 2 2.1 2.3 0.1 0.1 3.1 21 Scaled Euclidian (L 2 ) Notable distance metrics (and their level sets) L 1 norm (absolute) Consistency of 1NN Consider an estimator f n trained on n examples e.g., 1NN, neural nets, regression,... Estimator is consistent if true error goes to zero as amount of data increases e.g., for noisefree data, consistent if for any data distribution p(x): MSE( f n ) p(x) f n (x) y x 2 dx x Mahalanobis (here, on the previous slide is not necessarily diagonal, but is symmetric 23 L1 (max) norm Linear regression is not consistent! Representation bias 1NN is consistent What about noisy data? What about variance? 4

1NN overfits? knearest Neighbor Instancebased learning, four things to specify: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? k 1. A weighting function (optional) Unused 2. How to fit with the local points? Return the average output predict: (1/k) Σy i (summing over k nearest neighbors) k=1 knearest Neighbor Weighted distance metrics Suppose the input vectors x 1, x 2, x n are two dimensional: x 1 = ( x 11, x 12 ), x 2 = ( x 21, x 22 ), x N = ( x N1, x N2 ). Nearestneighbor regions in input space: k=9 Dist(x i,x j ) = (x i1 x j1 ) 2 (x i2 x j2 ) 2 Dist(x i,x j ) =(x i1 x j1 ) 2 (3x i2 3x j2 ) 2 Which is better? What can we do about the discontinuities? The relative scaling of the distance metric affect region shapes Weighted Euclidean distance metric Or equivalently, where 2 D(x,x') i x i x' i 2 D(x,x') (xx') T i Other Metrics Mahalanobis, Rankbased, Correlationbased, (xx') Kernel Regression Instancebased learning: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? All of them D(x 1,x 2 ) 3. A weighting function w i = exp(d(x i, query) 2 / K w2 ) Nearby points to the query are weighted strongly, Far points weighted weakly. The K W parameter is the Kernel Width. Very important. 4. How to fit with the local points? Predict the weighted average of the outputs: predict = Σw i y i / Σw i w i K w 5

Many possible weighting functions Kernel regression predictions w i = exp(d(x i, query) 2 / K w2 ) Typically: Choose D manually Optimize K w using gradient descent K W =10 K W =20 K W =80 Increasing the kernel width K w means further away points get an opportunity to influence you. As K w, the prediction tends to the global average. (Our examples use Gaussian) Kernel regression on our test cases Kernel regression: problem solved? NN k=9 Kernel regression K W =1/32 of xaxis width. K W =1/32 of xaxis width. K W =1/16 axis width. Choosing a good K w is important! Remind you of anything we have seen? K W = Best. K W = Best. K W = Best. Where are we having problems? Sometimes in the middle Generally, on the ends (extrapolation is hard!) Time to try something more powerful!!! Locally weighted regression Kernel regression: Take a very very conservative function approximator called AVERAGING. Locally weight it. Locally weighted regression: Take a conservative function approximator called LINEAR REGRESSION. Locally weight it. Locally weighted regression Instancebased learning, four things to specify: A distance metric Any How many nearby neighbors to look at? All of them A weighting function (optional) Kernels: wi = exp(d(xi, query) 2 / Kw 2 ) How to fit with the local points? General weighted regression: βˆ argmin β N 2 w k k k1 2 T y β x k 6

How LWR works LWR on our test cases Query Solving for each input: complex surface! Kernel regression K W =1/32 of xaxis width. K W =1/32 of xaxis width. K W=1/16 axis width. Linear regression Same parameters for all queries βˆ T 1 T X X X Y Locally weighted regression Solve weighted linear regression for each query WX T WX 1 WX T WY w1 0 0 0 w2 W 0 0 0 0 0 0 0 0 0 w n LWR K W = 1/16 of xaxis width. K W = 1/32 of xaxis width. K W = 1/8 of xaxis width. Locally weighted polynomial regression Challenges for based learning Kernel Regression: Kernel width K W at optimal level. K W = 1/100 xaxis K W = 1/40 xaxis K W = 1/15 xaxis Must store and retrieve all data! Most real work done during testing For every test sample, must search through all dataset very slow! But, there are fast methods for dealing with large datasets Instancebased learning often poor with noisy or irrelevant features In high dimensional spaces, all points will be very far from each other Can need a number of examples that is exponential in the dimension of X But, sometimes you are ok if you are clever about features Local quadratic regression is easy: just add quadratic terms to the WXTWX matrix. As the regression degree increases, the kernel width can increase without introducing bias. Curse of the irrelevant feature Curse of Dimensionality X 2 X 1 This is a contrived example, but similar problems are common in practice Need some form of feature selection!! 7

Curse of Dimensionality Fraction of volume of sphere lying in the range r = [1, 1] SideStepping the Curse Dimensionality reduction Eg, PCA Then use NN Gaussian Densities in higher dimensions 44 In Summary: InstanceBased Learning knn Simplest learning algorithm With sufficient data, very hard to beat strawman approach Picking k? Kernel regression Set k to n (number of data points) and optimize weights by gradient descent Smoother than knn Locally weighted regression Generalizes kernel regression, not just local average Curse of dimensionality Must remember (very large) dataset for prediction Irrelevant features often killers for instancebased approaches 45 Acknowledgment This lecture contains some material from Andrew Moore s excellent collection of ML tutorials: http://www.cs.cmu.edu/~awm/tutorials Carlos Guestrin 2005 2009 46 8