Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Similar documents
CSE446: non-parametric methods Spring 2017

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Curve Fitting Re-visited, Bishop1.2.5

BAYESIAN DECISION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CSE446: Clustering and EM Spring 2017

Classification: The rest of the story

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

L11: Pattern recognition principles

Metric-based classifiers. Nuno Vasconcelos UCSD

probability of k samples out of J fall in R.

Bayes Decision Theory - I

CS 6375 Machine Learning

Non-parametric Methods

Machine Learning Lecture 3

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

18.9 SUPPORT VECTOR MACHINES

Learning Theory Continued

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECE521 week 3: 23/26 January 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

CS 188: Artificial Intelligence Spring Announcements

STA414/2104 Statistical Methods for Machine Learning II

VBM683 Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

PAC-learning, VC Dimension and Margin-based Bounds

Announcements. Proposals graded

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Mining Classification Knowledge

Expectation Maximization Algorithm

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

PAC-learning, VC Dimension and Margin-based Bounds

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Stochastic Gradient Descent

Artificial Intelligence Roman Barták

10-701/ Machine Learning - Midterm Exam, Fall 2010

Mining Classification Knowledge

ECE 5424: Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning for Signal Processing Bayes Classification and Regression

Non-parametric Methods

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

CS 188: Artificial Intelligence Spring Announcements

Learning from Examples

ECE 5984: Introduction to Machine Learning

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Recap from previous lecture

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Day 3: Classification, logistic regression

Supervised Learning: Non-parametric Estimation

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning in a Nutshell. Data Model Performance Measure Machine Learner

Terminology for Statistical Data

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

MIRA, SVM, k-nn. Lirong Xia

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bias-Variance Tradeoff

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Tufts COMP 135: Introduction to Machine Learning

Classification and Pattern Recognition

CSC321 Lecture 2: Linear Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probability and Statistical Decision Theory

Density Estimation. Seungjin Choi

18.6 Regression and Classification with Linear Models

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Training the linear classifier

Nonparametric Methods Lecture 5

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Support Vector Machines

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

12 - Nonparametric Density Estimation

Pattern Recognition and Machine Learning

Generative v. Discriminative classifiers Intuition

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support vector machines Lecture 4

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Day 4: Classification, support vector machines

Lecture 2 Machine Learning Review

Machine Learning Midterm Exam March 4, 2015

We choose parameter values that will minimize the difference between the model outputs & the true function values.

6.036 midterm review. Wednesday, March 18, 15

CS599 Lecture 2 Function Approximation in RL

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Introduction to Machine Learning Midterm, Tues April 8

Transcription:

Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March 7, 2012 with slides from Carlos Guestrin & Bishop 2 What is Being Learned? Space of ML Problems Type of Supervision (eg, Experience, Feedback) Labeled Examples Discrete Classification Function Continuous Regression Function Policy Apprenticeship Learning Reward Reinforcement Learning Nothing Clustering Nonparametric Methods Parametric models restricted to specific forms May not be a good fit Nonparametric methods make few assumptions Often work extremely well in practice 3 Todo Histograms Bishop approach is too theoretical don t do it! Histogram methods partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. Often, the same width is used for all bins, i =. acts as a smoothing parameter. In an Ddimensional space, using M bins in each dimension will require M D bins! Why is this bad? 5 1

Nonparametric Methods Nonparametric Methods (3.5) Assume observations drawn from a density p(x) and consider a small region R containing x such that The probability that K out of N observations lie inside R is Bin(K N,P ) and if N is large If V (the volume of R) is sufficiently small, p(x) is approximately constant over R and Thus Hold K fixed, determine V from data Knearestneighbour approach Hold V fixed, determine K from data Kerneldensity estimation Kernel Density Estimation Fix V, estimate K from the data. Let R be a hypercube centred on x Define the kernel function (Parzen window) Kernel Density Estimation To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian It follows that and hence Any kernel such that h acts as a smoother. will work. Nearest Neighbour Density Estimation Fix K, estimate V from data. Consider a hypersphere centred on x Let it grow to a volume, V* that includes K of N data points. Then Pros / Cons Nonparametric models (besides histograms) Requires storing and computing with the entire data set. Parametric models, once fitted, Much more efficient i in terms of storage and computation. K acts as a smoother. 2

KNearestNeighbours for Classification KNearestNeighbours for Classification Given a data set with N k data points from class C k and, we have and correspondingly Since, Bayes theorem gives K = 3 K = 1 KNearestNeighbours for Classification Linear Regression: What can go wrong? K acts as a smoother As N, the error rate of the 1nearestneighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions). What do we do if the bias is too strong? Might want the data to drive the complexity of the model! Try instancebased Learning (a.k.a. nonparametric methods)? Using data to predict new data Nearest neighbor with Lots of Data Y X 18 3

Univariate 1Nearest Neighbor Given datapoints (x 1,y 1 ) (x 2,y 2 )..(x N,y N ),where we assume y i =f(x i ) for some unknown function f. Given query point x q, your job is to predict yˆ f x q Nearest Neighbor: 1. Find the closest x i in our set of datapoints 2. Predict yˆ nn argmin x i xq i y i nn Here s a dataset with one input, one output and four datapoints. y x i Here, this is the closest datapoint 1Nearest Neighbor is an example of. Instancebased learning A function approximator that has been around since about 1910. To make a prediction, search database for similar datapoints, and fit with the local points. x 1 y 1 x 2 y 2 x 3 y 3 To define an instancebased learner, specify four things: A distance metric How many nearby neighbors to look at? A weighting function (optional) How to fit with the local points? x n y n 19 1Nearest Neighbor To define an instancebased learner, specify four things: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? One 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the same output as the nearest neighbor. Multivariate 1NN examples Classification Regression 8 8 21 7 9 13 7.2 12 9 5 8 11 1 3 2 2.1 2.3 0.1 0.1 3.1 21 Scaled Euclidian (L 2 ) Notable distance metrics (and their level sets) L 1 norm (absolute) Consistency of 1NN Consider an estimator f n trained on n examples e.g., 1NN, neural nets, regression,... Estimator is consistent if true error goes to zero as amount of data increases e.g., for noisefree data, consistent if for any data distribution p(x): MSE( f n ) p(x) f n (x) y x 2 dx x Mahalanobis (here, on the previous slide is not necessarily diagonal, but is symmetric 23 L1 (max) norm Linear regression is not consistent! Representation bias 1NN is consistent What about noisy data? What about variance? 4

1NN overfits? knearest Neighbor Instancebased learning, four things to specify: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? k 1. A weighting function (optional) Unused 2. How to fit with the local points? Return the average output predict: (1/k) Σy i (summing over k nearest neighbors) k=1 knearest Neighbor Weighted distance metrics Suppose the input vectors x 1, x 2, x n are two dimensional: x 1 = ( x 11, x 12 ), x 2 = ( x 21, x 22 ), x N = ( x N1, x N2 ). Nearestneighbor regions in input space: k=9 Dist(x i,x j ) = (x i1 x j1 ) 2 (x i2 x j2 ) 2 Dist(x i,x j ) =(x i1 x j1 ) 2 (3x i2 3x j2 ) 2 Which is better? What can we do about the discontinuities? The relative scaling of the distance metric affect region shapes Weighted Euclidean distance metric Or equivalently, where 2 D(x,x') i x i x' i 2 D(x,x') (xx') T i Other Metrics Mahalanobis, Rankbased, Correlationbased, (xx') Kernel Regression Instancebased learning: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? All of them D(x 1,x 2 ) 3. A weighting function w i = exp(d(x i, query) 2 / K w2 ) Nearby points to the query are weighted strongly, Far points weighted weakly. The K W parameter is the Kernel Width. Very important. 4. How to fit with the local points? Predict the weighted average of the outputs: predict = Σw i y i / Σw i w i K w 5

Many possible weighting functions Kernel regression predictions w i = exp(d(x i, query) 2 / K w2 ) Typically: Choose D manually Optimize K w using gradient descent K W =10 K W =20 K W =80 Increasing the kernel width K w means further away points get an opportunity to influence you. As K w, the prediction tends to the global average. (Our examples use Gaussian) Kernel regression on our test cases Kernel regression: problem solved? NN k=9 Kernel regression K W =1/32 of xaxis width. K W =1/32 of xaxis width. K W =1/16 axis width. Choosing a good K w is important! Remind you of anything we have seen? K W = Best. K W = Best. K W = Best. Where are we having problems? Sometimes in the middle Generally, on the ends (extrapolation is hard!) Time to try something more powerful!!! Locally weighted regression Kernel regression: Take a very very conservative function approximator called AVERAGING. Locally weight it. Locally weighted regression: Take a conservative function approximator called LINEAR REGRESSION. Locally weight it. Locally weighted regression Instancebased learning, four things to specify: A distance metric Any How many nearby neighbors to look at? All of them A weighting function (optional) Kernels: wi = exp(d(xi, query) 2 / Kw 2 ) How to fit with the local points? General weighted regression: βˆ argmin β N 2 w k k k1 2 T y β x k 6

How LWR works LWR on our test cases Query Solving for each input: complex surface! Kernel regression K W =1/32 of xaxis width. K W =1/32 of xaxis width. K W=1/16 axis width. Linear regression Same parameters for all queries βˆ T 1 T X X X Y Locally weighted regression Solve weighted linear regression for each query WX T WX 1 WX T WY w1 0 0 0 w2 W 0 0 0 0 0 0 0 0 0 w n LWR K W = 1/16 of xaxis width. K W = 1/32 of xaxis width. K W = 1/8 of xaxis width. Locally weighted polynomial regression Challenges for based learning Kernel Regression: Kernel width K W at optimal level. K W = 1/100 xaxis K W = 1/40 xaxis K W = 1/15 xaxis Must store and retrieve all data! Most real work done during testing For every test sample, must search through all dataset very slow! But, there are fast methods for dealing with large datasets Instancebased learning often poor with noisy or irrelevant features In high dimensional spaces, all points will be very far from each other Can need a number of examples that is exponential in the dimension of X But, sometimes you are ok if you are clever about features Local quadratic regression is easy: just add quadratic terms to the WXTWX matrix. As the regression degree increases, the kernel width can increase without introducing bias. Curse of the irrelevant feature Curse of Dimensionality X 2 X 1 This is a contrived example, but similar problems are common in practice Need some form of feature selection!! 7

Curse of Dimensionality Fraction of volume of sphere lying in the range r = [1, 1] SideStepping the Curse Dimensionality reduction Eg, PCA Then use NN Gaussian Densities in higher dimensions 44 In Summary: InstanceBased Learning knn Simplest learning algorithm With sufficient data, very hard to beat strawman approach Picking k? Kernel regression Set k to n (number of data points) and optimize weights by gradient descent Smoother than knn Locally weighted regression Generalizes kernel regression, not just local average Curse of dimensionality Must remember (very large) dataset for prediction Irrelevant features often killers for instancebased approaches 45 Acknowledgment This lecture contains some material from Andrew Moore s excellent collection of ML tutorials: http://www.cs.cmu.edu/~awm/tutorials Carlos Guestrin 2005 2009 46 8