Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Similar documents
Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

probability of k samples out of J fall in R.

Nonparametric Methods Lecture 5

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Nonparametric probability density estimation

Supervised Learning: Non-parametric Estimation

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CSE446: non-parametric methods Spring 2017

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Non-parametric Methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

BAYESIAN DECISION THEORY

L11: Pattern recognition principles

Announcements. Proposals graded

Probability Models for Bayesian Recognition

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Curve Fitting Re-visited, Bishop1.2.5

Mining Classification Knowledge

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Introduction to Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 2

Lecture 3: Statistical Decision Theory (Part II)

Brief Introduction to Machine Learning

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machine (SVM) and Kernel Methods

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Linear & nonlinear classifiers

Machine Learning 2017

Chapter 9. Non-Parametric Density Function Estimation

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Mining Classification Knowledge

Classification: The rest of the story

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Recap from previous lecture

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Linear & nonlinear classifiers

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning Lecture 2

Machine Learning Lecture 5

Machine Learning for Signal Processing Bayes Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

GI01/M055: Supervised Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Chapter 9. Non-Parametric Density Function Estimation

Multivariate statistical methods and data mining in particle physics

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Machine Learning Practice Page 2 of 2 10/28/13

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Chemometrics: Classification of spectra

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECE521 week 3: 23/26 January 2017

12 - Nonparametric Density Estimation

CS 195-5: Machine Learning Problem Set 1

Non-parametric Methods

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

PATTERN RECOGNITION AND MACHINE LEARNING

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Exploiting k-nearest Neighbor Information with Many Data

Pattern Recognition 2

Machine Learning Lecture 3

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

COMS 4771 Introduction to Machine Learning. Nakul Verma

Issues and Techniques in Pattern Classification

Artificial Intelligence Roman Barták

Classification and Pattern Recognition

The Gaussian distribution

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Parametric Techniques

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Consistency of Nearest Neighbor Methods

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Statistical Learning Reading Assignments

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning & SVM

Metric-based classifiers. Nuno Vasconcelos UCSD

LECTURE NOTE #3 PROF. ALAN YUILLE

Pattern Recognition. Parameter Estimation of Probability Density Functions

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Parametric Techniques Lecture 3

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical learning. Chapter 20, Sections 1 4 1

DATA MINING LECTURE 10

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Support Vector Machine & Its Applications

Machine Learning Lecture 7

CS-E3210 Machine Learning: Basic Principles

Maximum Likelihood Estimation. only training data is available to design a classifier

COMS 4771 Introduction to Machine Learning. Nakul Verma

Sparse Kernel Machines - SVM

Transcription:

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest Neighbor Density Estimation Supervised: Instance-based learners Classification knn classification Weighted (or kernel) knn Regression knn regression Locally linear weighted regression 2

Introduction Estimation of arbitrary density functions Parametric density functions cannot usually fit the densities we encounter in practical problems. e.g., parametric densities are unimodal. Non-parametric methods don't assume that the model (from) of underlying densities is known in advance Non-parametric methods (for classification) can be categorized into Generative Estimate p(x C i ) from D i using non-parametric density estimation Discriminative Estimate p(c i x) from D 3

Parametric vs. nonparametric methods Parametric methods need to find parameters from data and then use the inferred parameters to decide on new data points Learning: finding parameters from data Nonparametric methods Training examples are explicitly used Training phase is not required Both supervised and unsupervised learning methods can be categorized into parametric and non-parametric methods. 4

Histogram approximation idea Histogram approximation of an unknown pdf P(b l ) k n (b l )/n l = 1,, L k n (b l ): number of samples (among n ones) lied in the bin b l k n The corresponding estimated pdf: p x = P(b l) h x x bl h 2 h Mid-point of the bin b l 5

Non-parametric density estimation Probability of falling in a region R: P = R p x dx (smoothed version of p x ) D = x i n i=1 : a set of samples drawn i.i.d. according to p x The probability that k of the n samples fall in R: P k = n k Pk 1 P n k E k = np This binomial distribution peaks sharply about the mean: k np k as an estimate for P n More accurate for larger n 6

Non-parametric density estimation We can estimate smoothed p x by estimating P: Assumptions: p x is continuous and the region R enclosing x is so small that p is near constant in it: P = R p x dx = p x V V = Vol R x R p x = P V k/n V Let V approach zero if we want to find p x averaged version. instead of the 7

Necessary conditions for converge p n x is the estimate of p x using n samples: 8 V n : the volume of region around x k n : the number of samples falling in the region p n x = k n/n V n Necessary conditions for converge of p n x to p(x): lim V n = 0 n lim k n = n lim k n /n = 0 n

Non-parametric density estimation: Main approaches Two approaches of satisfying conditions: k-nearest neighbor density estimator: fix K and determine the value ofv from the data Volume grows until it contains K neighbors of x Kernel density estimator (Parzen window): fix V and determine K from the data Number of points falling inside the volume can vary from point to point 9

Parzen window Extension of histogram idea: Hyper-cubes with length of side h (i.e., volume h d ) are located on the samples Hypercube as a simple window function: φ u = 1 ( u 1 1 2 u d 1 2 ) 0 o. w. 1 1/2 1/2 1 1/2 1/2 p n x = k n nv n = 1 nv n i=1 n φ x x(i) h n k n = n i=1 V n = h n d φ x x(i) h n number of samples in the hypercube around x 10

Window function Necessary conditions for window function to find legitimate density function: φ(x) 0 φ x dx = 1 Windows are also called kernels or potential functions. 11

Density estimation: non-parametric p n x = 1 n n i=1 N(x x (i), h 2 ) 1 2πh e x x (i) 2 2h 2 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5 σ = h p x = 1 n = 1 n i=1 n i=1 n 1 N(x x (i), σ 2 ) 2πσ e x x (i) 2 2σ 2 12 Choice of σ is crucial.

Density estimation: non-parametric σ = 0.02 σ = 0.1 σ = 0.5 σ = 1.5 13

Window (or kernel) function: Width parameter n p n x = 1 n 1 h n d i=1 φ x x(i) h n Choosing h n : Too large: low resolution Too small: much variability [Duda, Hurt, and Stork] For unlimited n, by letting V n slowly approach zero as n increases p n (x) converges to p(x) 14

Width parameter For fixed n, a smaller h results in higher variance while a larger h leads to higher bias. For a fixed h, the variance decreases as the number of sample points n tends to infinity for a large enough number of samples, the smaller h the better the accuracy of the resulting estimate In practice, where only a finite number of samples is possible, a compromise between h and n must be made. h can be set using techniques like cross-validation where the density estimation used for learning tasks such as classification 15

Practical issues: Curse of dimensionality Large n is necessary to find an acceptable density estimation in high dimensional feature spaces n must grow exponentially with the dimensionality d. If n equidistant points are required to densely fill a one-dim interval, n d points are needed to fill the corresponding d-dim hypercube. We need an exponentially large quantity of training data to ensure that the cells are not empty Also complexity requirements 16 d = 1 d = 2 d = 3

k n -nearest neighbor estimation Cell volume is a function of the point location To estimate p(x), let the cell around x grow until it captures k n samples called k n nearest neighbors of x. k n is a function of n Two possibilities can occur: high density near x cell will be small which provides a good resolution low density near x cell will grow large and stop until higher density regions are reached 17

k n -nearest neighbor estimation Necessary and sufficient conditions of convergence: lim k n n lim k n /n 0 n A family of estimates by setting k n = k 1 different values for k 1 : n and choosing p n x = k n/n V n V n 1/p(x) n k 1 = 1 V n is a function of x 18

k n -Nearest Neighbor Estimation: Example Discontinuities in the slopes [Bishop] 19

Non-parametric density estimation: Summary Generality of distributions With enough samples, convergence to an arbitrarily complicated target density can be obtained. The number of required samples must be very large to assure convergence grows exponentially with the dimensionality of the feature space These methods are very sensitive to the choice of window width or number of nearest neighbors There may be severe requirements for computation time and storage (needed to save all training samples). training phase simply requires storage of the training set. computational cost of evaluating p(x) grows linearly with the size of the data set. 20

Nonparametric learners Memory-based or instance-based learners lazy learning: (almost) all the work is done at the test time. Generic description: Memorize training (x (1), y (1) ),..., (x (n), y (n) ). Given test x predict: y = f(x; x (1), y (1),..., x (n), y (n) ). f is typically expressed in terms of the similarity of the test sample x to the training samples x (1),..., x (n) 21

Parzen window & generative classification If 1 n1 1 h d 1 n2 1 h d x (i) D1 x (i) D2 φ x x(i) h φ x x(i) h > P(C 2) P(C 1 ) decide C 1 otherwise decide C 2 n j = D j (j = 1,2): number of training samples in class C j D j : set of training samples labels as C j For large n, it needs both high time and memory requirements 22

Parzen window & generative classification: Example Smaller h larger h 23 [Duda, Hurt, and Stork]

k n -nearest neighbor estimation & generative classification If kn 2V 2 kn 1 V 1 > P(C 2) P(C 1 ) decide C 1 otherwise decide C 2 n j = D j (j = 1,2): number of training samples in class C j D j : set of training samples labels as C j V j shows the hypersphere volumes r j : the radius of the hypersphere centered at x containing k samples of the class C j (j = 1,2) k may not necessarily be the same for all classes 24

k-nearest-neighbor (knn) rule k-nn classifier: k > 1 nearest neighbors Label for x predicted by majority voting among its k-nn. x 2 k = 5 1? x = [x 1, x 2 ] 1-1 1 x 1 What is the effect of k? 25

knn classifier Given Training data {(x 1, y 1 ),..., (x n, y n )} are simply stored. Test sample: x To classify x: Find k nearest training samples to x Out of these k samples, identify the number of samples k j belonging to class C j (j = 1,, C). Assign x to the class C j where j = argmax k j j=1,,c It can be considered as a discriminative method. 26

Probabilistic perspective of knn knn as a discriminative nonparametric classifier Non-parametric density estimation for P(C j x) P C j x k j k where k j shows the number of training samples among k nearest neighbors of x that are labeled C j Bayes decision rule for assigning labels 27

Nearest-neighbor classifier: Example Voronoi tessellation: Each cell consists of all points closer to a given training point than to any other training points All points in a cell are labeled by the category of the corresponding training point. 28 [Duda, Hurt, and Strok s Book]

knn classifier: Effect of k [Bishop] 29

Nearest neighbor classifier: error bound Nearest-Neighbor: knn with k = 1 Decision rule: y = y NN(x) where NN(x) = argmin i=1,,n x x (i) Cover & Hart 67: asymptotic risk of NN classifier satisfies: R R NN 2R (1 R ) 2R R n : expected risk of NN classifier with n training examples drawn from p(x, y) R NN = lim n R n NN R : the optimal Bayes risk 30

k-nn classifier: error bound Devr 96: the asymptotic risk of the knn classifier R = lim n R n satisfies R R knn R + where R is the optimal Bayes risk. 2R NN k 31

Instance-based learner Main things to construct an instance-based learner: A distance metric Number of nearest neighbors of the test data that we look at A weighting function (optional) How to find the output based on neighbors? 33

Distance measures Euclidean distance d x, x = x x 2 2 = x 1 x 1 2 + + x d x d 2 Distance learning methods for this purpose Weighted Euclidean distance d w x, x = w 1 x 1 x 1 2 + + w d x d x d 2 Mahalanobis distance d A x, x = x 1 x 1 T A x 1 x 1 Other distances: Hamming, angle, Sensitive to irrelevant features L p x, x = p d i=1 x i x i p 34

Distance measure: example d x, x = x 1 x 1 2 + x 2 x 2 2 d x, x = x 1 x 1 2 + 3 x 2 x 2 2 35

Weighted knn classification Weight nearer neighbors more heavily: y = f x = argmax c=1,,c j N k (x) w j (x) I(c = y j ) w j (x) = 1 x x (j) 2 An example of weighting function In the weighted knn, we can use all training examples instead of just k (Stepard s method): 36 y = f x = argmax c=1,,c n j=1 w j (x) I(c = y j ) Weights can be found using a kernel function w j (x) = K(x, x (j) ): e.g., K(x, x (j) ) = e d(x,x(j) ) σ 2

Weighting functions d = d(x, x ) 37 [Fig. has been adopted from Andrew Moore s tutorial on Instance-based learning ]

knn regression Simplest k-nn regression: Let x 1,, x k be the k nearest neighbors of x and y 1,, y k be their labels. y = 1 k k j=1 y j Problems of knn regression for fitting functions: Problem 1: Discontinuities in the estimated function Solution:Weighted (or kernel) regression 1NN: noise-fitting problem knn ( k > 1 ) smoothes away noise, but there are other deficiencies. flats the ends 38

knn regression: examples Dataset 1 Dataset 2 Dataset 3 k = 1 k = 9 39 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Weighted (or kernel) knn regression Higher weights to nearer neighbors: y = f x = j N k (x) w j(x) y (j) j N k (x) w j(x) In the weighted knn regression, we can use all training examples instead of just k in the weighted form: y = f x = j=1 n w j (x) y (j) n w j (x) j=1 40

Kernel knn regression σ = 1 32 of x-axis width σ = 1 32 of x-axis width σ = 1 16 of x-axis width Choosing a good parameter (kernel width) is important. 41 [This slide has been adapted from Andrew Moore s tutorial on Instance-based learning ]

Kernel knn regression In these datasets, some regions are without samples Best kernel widths have been used Disadvantages: not capturing the simple structure of the data failure to extrapolate at edges 42 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Locally weighted linear regression For each test sample, it produces linear approximation to the target function in a local region Instead of finding the output using weighted averaging (as in the kernel regression), we fit a parametric function locally: y = f x, x 1, y 1,..., x n, y n y = f x; w = w 0 + w 1 x 1 + + w d x d J w = i N k (x) y i w T x i 2 Unweighted linear 43 w is found for each test sample

Locally weighted linear regression y = f x, x 1, y 1,..., x n, y n J w x = y i w T x i 2 i N k (x) J w x = K x, x i y i w T x i 2 i N k (x) unweighted e.g, K x,x i = e x x i 2σ 2 weighted 2 J w x = n i=1 K x, x i y i w T x i 2 Weighted on all training examples 44

Locally weighted linear regression: example σ = 1 16 of x-axis width σ = 1 32 of x-axis width σ = 1 8 of x-axis width More proper result than weighted knn regression 45 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Locally weighted regression: summary Idea 1: weighted knn regression using the weighted average on the output of x s neighbors (or on the outputs of all training data): y = i=1 k y i K(x, x (i) ) k K(x, x (j) ) j=1 Idea 2: Locally weighted parametric regression Fit a parametric model (e.g. linear function) to the neighbors of x (or on all training data). Implicit assumption: the target function is reasonably smooth. y 46 x

Parametric vs. nonparametric methods Is SVM classifier parametric? y = sign(w 0 + α i y i K(x, x (i) )) α i >0 In general, we can not summarize it in a simple parametric form. Need to keep around support vectors (possibly all of the training data). However, α i are kind of parameters that are found in the training phase 47

Instance-based learning: summary Learning is just storing the training data prediction on a new data based on the training data themselves An instance-based learner does not rely on assumption concerning the structure of the underlying density function. With large datasets, instance-based methods are slow for prediction on the test data kd-tree, Locally Sensitive Hashing (LSH), and other knn approximations can help. 48

Reference T. Mitchell, Machine Learning, 1998. [Chapter 8] 49