Introduction to Machine Learning

Similar documents
An Introduction to Machine Learning

Announcements. Proposals graded

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

probability of k samples out of J fall in R.

PAC-learning, VC Dimension and Margin-based Bounds

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bias-Variance Tradeoff

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Advanced Introduction to Machine Learning CMU-10715

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Recap from previous lecture

PAC-learning, VC Dimension and Margin-based Bounds

Stochastic Gradient Descent

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Support Vector Machines

BAYESIAN DECISION THEORY

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

An Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 2 Machine Learning Review

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning

12 - Nonparametric Density Estimation

Machine Learning Midterm Exam March 4, 2015

Stephen Scott.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Midterm exam CS 189/289, Fall 2015

Machine Learning for Computational Advertising

Non-parametric Methods

Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Exploiting k-nearest Neighbor Information with Many Data

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Learning (II)

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Nonparametric Methods Lecture 5

Machine Learning Lecture 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Density Estimation. Seungjin Choi

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression. Machine Learning Fall 2018

Maximum Mean Discrepancy

Brief Introduction to Machine Learning

CSE446: non-parametric methods Spring 2017

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Learning Theory Continued

Curve Fitting Re-visited, Bishop1.2.5

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ad Placement Strategies

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

ECE521 week 3: 23/26 January 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probability and Statistical Decision Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

10-701/ Machine Learning - Midterm Exam, Fall 2010

Naïve Bayes classification

The Bayes classifier

A first model of learning

Nonparametric Bayesian Methods (Gaussian Processes)

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Non-parametric Methods

Nonparametric probability density estimation

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Clustering, Topic Modeling, & Bayesian Inference

Understanding Generalization Error: Bounds and Decompositions

Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Tufts COMP 135: Introduction to Machine Learning

Classification: The rest of the story

Probability Models for Bayesian Recognition

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Transcription:

Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

Outline Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows

Parzen Windows Parzen

Density Estimation Observe some data x i Want to estimate p(x) Find unusual observations (e.g. security) Find typical observations (e.g. prototypes) Classifier via Bayes Rule p(y x) = p(x, y) p(x) = P p(x y)p(y) y 0 p(x y 0 )p(y 0 ) Need tool for computing p(x) easily

Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 5 2 3 1 0 female 6 3 2 2 1

Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female not enough data Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension

Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension

Density Estimation 0.10 0.05 sample 0.04 underlying density 0.03 0.05 0.02 0.01 0.00 40 50 60 70 80 90 100 110 0.00 40 50 60 70 80 90 100 110 Continuous domain = infinite number of bins Curse of dimensionality 10 bins on [0, 1] is probably good 10 10 bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by 10 10.

Bin Counting

Bin Counting

Bin Counting

Bin Counting can t we just go and smooth this out?

What is happening? Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 For any average of [0,1] iid random variables. Bin counting Random variables x i are events in bins Apply Hoeffding s theorem to each bin Take the union bound over all bins to guarantee that all estimates converge

Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp 2m 2 good news Solving for error probability 2 A apple exp( m 2 )=) apple r log 2 A 2m log

Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log

Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 bins not Applying the union bound and independent Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log

Bin Counting

Bin Counting can t we just go and smooth this out?

Parzen Windows Naive approach Use empirical density (delta distributions) p emp (x) = 1 m mx i=1 x i (x) This breaks if we see slightly different instances Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x ) satisfying Z X k x (x 0 )dx 0 = 1 for all x

Parzen Windows Density estimate p emp (x) = 1 m ˆp(x) = 1 m mx i=1 mx x i (x) k xi (x) Smoothing kernels i=1 1.0 1.0 1.0 1.0 Gauss Laplace Epanechikov Uniform 0.5 0.5 0.5 0.5 0.0-2 -1 0 1 2 0.0-2 -1 0 1 2 0.0-2 -1 0 1 2 (2 ) 1 2 e 1 2 x2 1 2 e x 3 4 max(0, 1 x2 ) 0.0-2 -1 0 1 2 1 2 [ 1,1] (x)

Smoothing

Smoothing dist = norm(x - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))

Smoothing

Smoothing

Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 40 60 80 100 0.000 40 60 80 100 0.000 40 60 80 100 0.000 40 60 80 100 0.3 1 3 10 0.10 0.05 0.04 0.03 0.05 0.02 0.01 0.00 40 50 60 70 80 90 100 110 0.00 40 50 60 70 80 90 100 110

Size matters Shape matters mostly in theory 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 40 60 80 100 Kernel width 0.000 40 60 80 100 Too narrow overfits Too wide smoothes with constant distribution How to choose? 0.000 40 60 80 100 k xi (x) =r d h x r xi 0.000 40 60 80 100

Model Selection

Maximum Likelihood Need to measure how well we do For density estimation we care about Pr {X} = my p(x i ) i=1 Finding a that maximizes P(X) will peak at all data points since x i explains x i best... Maxima are delta functions on data. Overfitting!

Overfitting 0.050 Likelihood on 0.025 0 training set is much higher than typical. 0.000 40 60 80 100

Overfitting 0.050 density 0 Likelihood on 0.025 0 training set is much higher than typical. density 0 0.000 40 60 80 100

Underfitting 0.050 Likelihood on training set is 0.025 very similar to typical one. Too simple. 0.000 40 60 80 100

Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1

Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy

Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy wasteful

Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n difficult i=1 easy wasteful

Model Selection Leave-one-out Crossvalidation Use almost all data to estimate density. Use single instance to estimate how well it works 1 X log p(x i X\x i ) = log k(x i,x j ) n 1 This has huge variance Average over estimates for all training data Pick the parameter that works best Simple implementation 1 n nx i=1 log apple n n 1 p(x i) j6=i 1 n 1 k(x i,x i ) where p(x) = 1 n nx k(x i,x) i=1

Leave-one out estimate

Optimal estimate

Model Selection k-fold Crossvalidation Partition data into k blocks (typically 10) Use all but one block to compute estimate Use remaining block as validation set Average over all validation estimates kx 1 l(p(x i X\X i )) k i=1 Almost unbiased (e.g. via Luntz and Brailovski, 1969) (error is for (k-1)/k sized set) Pick best parameter (why must we not check too many?)

Watson Nadaraya Estimator Geoff Watson

From density estimation to classification Binary classification Estimate p(x y = 1) and p(x y = 1) Use Bayes rule p(y x) = p(x y)p(y) p(x) = P 1 m y y i =y k(x i,x) my m P 1 m i k(x i,x) Decision boundary p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) local weights

Watson-Nadaraya Classifier

Watson-Nadaraya Classifier dist = norm(x - x * ones(1,m),'columns'); f = sum(y.* exp(-0.5 * dist.**2));

Watson Nadaraya Regression Binary classification p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) Regression - use same weighted expansion ŷ(x) = X j y j k(x j,x) P i k(x i,x) labels local weights

Watson-Nadaraya regression estimate

Silverman s Rule Bernard Silverman

Silverman s rule Chicken and egg problem Want wide kernel for low density region Want narrow kernel where we have much data Need density estimate to estimate density Simple hack Use average distance from k nearest neighbors r i = r k X kx i xk x2nn(x i,k) Nonuniform bandwidth for smoother.

Density true density

non adaptive estimate

adaptive estimate

distance distribution

Nearest Neighbor

Nearest Neighbors Table lookup For previously seen instance remember label Nearest neighbor Pick label of most similar neighbor Slight improvement - use k-nearest neighbors For regression average Really useful baseline! Easy to implement for small amounts of data.

Relation to Watson Nadaraya Watson Nadaraya estimator ŷ(x) = X j y j k(x i,x) P y j w j (x) i k(x i,x) = X j Nearest neighbor estimator ŷ(x) = X j y j k(x j,x) P y j w j (x) i k(x i,x) = X j Neighborhood function is hard threshold.

1-Nearest Neighbor

4-Nearest Neighbors

4-Nearest Neighbors Sign

If we get more data 1 Nearest Neighbor Converges to perfect solution if separation Twice the minimal error rate 2p(1-p) for noisy problems k-nearest Neighbor Converges to perfect solution if separation (but needs more data) Converges to minimal error min(p,1-p) for noisy problems (use increasing k)

1 Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > d/n Probability that at least one point of n is in this neighborhood is 1-e -d so we can make this small Assume that probability mass doesn t change much in neighborhood Probability that labels of query and point do not match is 2p(1-p) (up to some approximation error in neighborhood)

k Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > dk/n Small probability that we don t have at least k points in neighborhood. Assume that probability mass doesn t change much in neighborhood Bound probability that majority of points doesn t match majority for p (e.g. via Hoeffding s theorem for tail). Show that it vanishes Error is therefore min(p, 1-p), i.e. Bayes optimal error.

Fast lookup KD trees (Moore et al.) Partition space (one dimension at a time) Only search for subset that contains point Cover trees (Beygelzimer et al.) Hierarchically partition space with distance guarantees No need for nonoverlapping sets Bounded number of paths to follow (logarithmic time lookup)

Summary Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows

Further Reading Cover tree homepage (paper & code) http://hunch.net/~jl/projects/cover_tree/cover_tree.html http://doi.acm.org/10.1145/361002.361007 (kd trees, original paper) http://www.autonlab.org/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf (Andrew Moore s tutorial from his PhD thesis) Nadaraya s regression estimator (1964) http://dx.doi.org/10.1137/1109020 Watson s regression estimator (1964) http://www.jstor.org/stable/25049340 Watson-Nadaraya regression package in R http://cran.r-project.org/web/packages/np/index.html Stone s k-nn regression consistency proof http://projecteuclid.org/euclid.aos/1176343886 Cover and Hart s k-nn classification consistency proof http://www-isl.stanford.edu/people/cover/papers/transit/0021cove.pdf Tom Cover s rate analysis for k-nn Rates of Convergence for Nearest Neighbor Procedures. Sanjoy Dasgupta s analysis for k-nn estimation with selective sampling http://cseweb.ucsd.edu/~dasgupta/papers/nnactive.pdf Multiedit & Condense (Dasarathy, Sanchez, Townsend) http://cgm.cs.mcgill.ca/~godfried/teaching/pr-notes/dasarathy.pdf Geometric approximation via core sets http://valis.cs.uiuc.edu/~sariel/papers/04/survey/survey.pdf