Supervised Learning: Non-parametric Estimation

Similar documents
Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Nonparametric Methods Lecture 5

The Gaussian distribution

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probability Models for Bayesian Recognition

probability of k samples out of J fall in R.

BAYESIAN DECISION THEORY

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Chapter 9. Non-Parametric Density Function Estimation

CMU-Q Lecture 24:

41903: Introduction to Nonparametrics

Statistics: Learning models from data

L11: Pattern recognition principles

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Chapter 9. Non-Parametric Density Function Estimation

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Expect Values and Probability Density Functions

Classification: The rest of the story

12 - Nonparametric Density Estimation

Metric-based classifiers. Nuno Vasconcelos UCSD

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Artificial Neural Networks (ANN)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Maximum Likelihood Estimation. only training data is available to design a classifier

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

STA 414/2104, Spring 2014, Practice Problem Set #1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

More on Estimation. Maximum Likelihood Estimation.

CMSC858P Supervised Learning Methods

Pattern Recognition. Parameter Estimation of Probability Density Functions

Bayesian Decision Theory

Nonparametric Methods

Density Estimation: ML, MAP, Bayesian estimation

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Statistical inference on Lévy processes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Parametric Techniques

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Applied inductive learning - Lecture 4

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA

A tailor made nonparametric density estimate

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Semi-Supervised Learning through Principal Directions Estimation

Introduction to Machine Learning

Nonparametric probability density estimation

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Parametric Techniques Lecture 3

Pattern Recognition 2

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

A PECULIAR COIN-TOSSING MODEL

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

PATTERN RECOGNITION AND MACHINE LEARNING

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Lecture 35: December The fundamental statistical distances

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Optimal global rates of convergence for interpolation problems with random design

Lecture 4: September Reminder: convergence of sequences

Deep Learning for Computer Vision

A first model of learning

Nearest Neighbor Pattern Classification

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Lecture 4: Probabilistic Learning

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Bayesian Decision and Bayesian Learning

Lecture : Probabilistic Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Overfitting, Bias / Variance Analysis

Tips and Tricks in Real Analysis

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

9 Classification. 9.1 Linear Classifiers

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 546 Final Exam, Autumn 2013

18.9 SUPPORT VECTOR MACHINES

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Metric Embedding for Kernel Classification Rules

Machine Learning 2017

Adaptive Sampling Under Low Noise Conditions 1

Naïve Bayes classification

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning Lecture 3

An Introduction to Machine Learning

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

PAC-learning, VC Dimension and Margin-based Bounds

Announcements. Proposals graded

Multi-layer Neural Networks

6.1 Moment Generating and Characteristic Functions

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Minimum Error-Rate Discriminant

Lecture 3: Statistical Decision Theory (Part II)

Time Series Classification

Transcription:

Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018

Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates: 1. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate the pdf p(x ω i ) relying on the available data 2. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate P(ω i x), that is the discriminant function (posterior probability) 3. the feature space X is first projected onto a sub-space Y (having reduced dimensionality) via ϕ : X Y such that, given x and letting y = ϕ(x), computing P(ω i y) = p(y ω i )P(ω i ) p(y) turns out to be easier than computing P(ω i x) = p(x ω i )P(ω i ) p(x) 1 This is the fundamental thing: we recognize we do not know anything!

Non-parametric pdf estimation Let us elaborate on the relation between pdf and probability: what is the probability P that a generic pattern x, drawn from a certain pdf p(.), belongs to an arbitrary region R of the feature space? We have: P = p(x)dx (1) R Thence, P is an averaged (over R) version of p(x). If we can estimate P, then we we can come up with an estimate of the averaged version of p(x). Assume we collected a random sample of n data, x 1,..., x n, independently and identically distributed (iid) according to p(x). If k out of n data are in R, we can estimate P via the relative frequency: P k n (2)

If p(x) is continuous and has little variation in R (e.g. if R is small) then: p(x)dx p(x 0 )V (3) R where x 0 R and V is the volume of R. Thus far we have P = R p(x)dx and P k n, thence: p(x 0 ) k/n V (4) Practical problem: to retrieve p(x 0 ) instead of its version averaged over R, V should become 0. Since n is fixed in real-world cases, this would drive k to 0 as well, thenceforth making the estimate p(x 0 ) 0 V useless.

Some theoretical issues Let x 0 be any pattern of interest. Let us imagine the number n of available data may grow unbounded. We can then build a sequence of regions R 1, R 2,..., R n,... s.t. x 0 R i i as n increases. We use R i to estimate p(x 0 ) from i data. Let: V n be the volume of R n k n be the number of data (out of n) in R n p n (x 0 ) be the n-th estimate of p(x 0 ), i.e., p n (x 0 ) = kn/n V n Asymptotic necessary and sufficient conditions that ensure p n (x 0 ) p(x 0 ): 1. lim n V n = 0 2. lim n k n = 3. lim n k n /n = 0 (to guarantee convergence) How do we satisfy 1, 2 and 3? There are two complementary approaches that make sure p n (x 0 ) p(x 0 ) in probability: 1. Fix a proper volume, say V n = 1/ (n), and determine k n /n consequently (Parzen Window) 2. Fix k n, e.g. k n = n, and determine V n consequently, in such a way that exactly k n patterns fall in R n (k n -nearest neighbor)

Parzen Window To fix ideas, let us assume that R n is an hypercube having edge h n (thus, V n = hn d ). We define a window function (or, kernel): { 1 uj 1/2 j = 1, 2,..., d ϕ(u) = 0 else ( x0 x It is seen that ϕ i is a kernel having value 1 only within the h n ) hypercube centered in x 0 and having edge h n. The number k n of data in this hypercube is: n ( ) x0 x i k n = ϕ (6) i=1 h n (5) Bearing in mind that p n (x 0 ) = kn/n V n, we have: p n (x 0 ) = 1 n ( 1 x0 x i ϕ n V n i=1 h n ) (7)

The approach can be extended to window functions ϕ ( ) of different form. The equation: p n (x 0 ) = 1 n ( ) 1 xi x 0 ϕ n V n i=1 tells us that, using the n available data, the estimate p n (.) of the unknown pdf p(.) is obtained by averaging the kernel function ϕ ( ) over x 1,..., x n, i.e. by interpolation 2. We need a ϕ ( ) such that p n ( ) is a pdf, that is p n ( ) 0 and pn (x)dx = 1. If V n = hn d, this is guaranteed by the (sufficient) conditions: 1. ϕ(u) 0 u R d 2. ϕ(u)du = 1 Exercise: check it out. A popular and effective choice for ϕ ( ) is the Gaussian kernel, namely: ϕ(u) = N(u; 0, 1I ) (9) 2 In fact, I can think of each kernel as of a window centered in x i and evaluated on x 0, since ϕ(.). is a symmetric function. h n (8)

Q: how deep is the impact of the window width h n on the quality of the estimate p n (x)? Let us define: δ n (x) = 1 ( ) x ϕ (10) V n Since we can write Thence: p n (x) = 1 n p n (x) = 1 n n i=1 h n ( 1 x xi ϕ V n h n n ( ) δ n x xi i=1 ) (11) (12) if h n is large, then δ n (x) is very smooth (having small variation), and p n (x) is yielded by the superposition of wide, smooth functions; if h n is small, then δ n (x) is very peaked around x = x i. Since δn (x x i )dx = 1 V n ϕ ( x xi h n ) dx = ϕ(u)du = 1, if h n 0 then δ n (x x i ) converges to a Dirac s Delta centered in x i

p(x) = N(x; 0, 1) ϕ(u) = N(u; 0, 1) hn = h1 / n n pn (x) = 1X 1 ϕ n hn i=1 x xi hn

2.5 < x < 2 1 0.25 0 < x < 2 p(x) = 0 otherwise ϕ(u) = N(u; 0, 1) hn = h1 / n

k n -Nearest Neighbor In Parzen Window, the asymptotic conditions are guaranteed by taking h n = h 1 / n. Unfortunately, in the finite-sample case the estimate is affected by the choice for the initial edge (bandwidth) h 1 : if h 1 is too small, then p n ( ) = 0 if h 1 is too big, then p n ( ) = E[p( )] The inconvenience could be overcome by choosing the volume according to the nature of the data! 1. define k n as a function of n 2. to estimate p(x 0 ), consider a small ball around x 0 and let it grow until it embraces k n data (the k n -nearest neighbors) 3. Eventually: p n (x 0 ) = k n/n V n as we did in the generic non-parametric pdf estimation setup.

Good asymptotic behavior p n (x 0 ) p(x 0 ) is guaranteed once these two necessary and sufficient conditions hold: 1. lim n k n = 2. lim n k n n = 0 For instance, by letting k n = n the conditions are satisfied. Moreover: V n = 1 p(x 0 ) n (13) that is, exactly like in Parzen Window, we have V n = V 1 / n but V 1 = 1/p(x 0 ), i.e., V 1 is not chosen arbitrarily but it is uniquely determined by the nature of the data. In practice, nonetheless, we usually set k n = k 1 n where k1 is determined empirically.

Non-parametric Decision Rule Let {x 1, x 2,..., x n } be a supervised sample Take a ball (of volume V ) around x 0, s.t. it embraces k patterns Among these k patterns, let k i be the number of patterns of class ω i Since p n (x 0 ) = k/n V, we have: p n (x 0, ω i ) = k i/n V Thence, an estimate P n (ω i x 0 ) of P(ω i x 0 ) is given by k p n (x 0, ω i ) i /n P n (ω i x 0 ) = c j=1 p n(x 0, ω j ) = V c j=1 k j /n V = k i k i.e., the fraction of data of class ω i that falls within the ball under consideration. The ball (that is, its volume V ) may be fixed either using the Parzen Window or the k n -nearest neighbors philosophies.

Nearest Neighbor Algorithm The aforementioned decision rule, when applied along with the k n -NN philosophy and k n = 1, takes the (simple, yet effective) form of a popular classification algorithm. Let Y n = {x 1, x 2,..., x n } be the supervised sample, and let x 0 be a pattern to be classified. Nearest-neighbor decision rule: assign x 0 to class ω i if and only if (i) x n Y n is the nearest 3 pattern to x 0, among all those in Y n ; and (ii) x n belongs to class ω i. It is possible to show that, asymptotically, the performance of NN for n is as good as that of k n -NN. Intuitively, this is due to the fact that the probability that x n belongs to ω i is P(ω i x n). If we had many data (say, n ), then P(ω i x n) P(ω i x 0 ). Using NN, indeed, I assign x 0 to ω i due to ω i having taken place in x 0 with the highest probability, i.e. P(ω i x 0 ), 3 According to Euclidean distance.

K-Nearest Neighbor Given the sample Y n = {x 1, x 2,..., x n } and the pattern x 0 to be classified, we can apply this K-NN decision rule: consider the k patterns in Y n that are the nearest to x 0 (in terms of Euclidean distance). Assign x 0 to ω i if and only if the latter is the class having the highest relative frequency within this sub-sample of k patterns. Remarks: K-NN is in the spirit of k n -NN while k n -NN estimates a pdf, K-NN estimates P(ω i x) for n, the asymptotic behavior of K-NN tends to be optimal (i.e., Bayesian) in 2-class cases, an odd value for k is used in order to break ties more accurate decisions would be taken when k, but in the finite-sample case we cannot move away from x 0 too far (trade-off)