probability of k samples out of J fall in R.

Similar documents
Nonparametric Methods Lecture 5

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Nonparametric probability density estimation

Supervised Learning: Non-parametric Estimation

Introduction to Machine Learning

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

F O R SOCI AL WORK RESE ARCH

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

BAYESIAN DECISION THEORY

Maximum Likelihood Estimation. only training data is available to design a classifier

The Gaussian distribution

Curve Fitting Re-visited, Bishop1.2.5

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

IE 400 Principles of Engineering Management. Graphical Solution of 2-variable LP Problems

Announcements. Proposals graded

Machine Learning: Logistic Regression. Lecture 04

Bayesian Learning (II)

CSE446: non-parametric methods Spring 2017

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Limits and Continuity. 2 lim. x x x 3. lim x. lim. sinq. 5. Find the horizontal asymptote (s) of. Summer Packet AP Calculus BC Page 4

Nearest Neighbor Pattern Classification

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Practice Page 2 of 2 10/28/13

L11: Pattern recognition principles

MATH 174: Numerical Analysis I. Math Division, IMSP, UPLB 1 st Sem AY

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

12 - Nonparametric Density Estimation

Logistic Regression. Machine Learning Fall 2018

Concepts in Statistics

CS60021: Scalable Data Mining. Large Scale Machine Learning

MLE/MAP + Naïve Bayes

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CHAPTER 3 Describing Relationships

Expectation Maximization Algorithm

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )

Continuous-time Fourier Methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Variance Reduction and Ensemble Methods

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Statistics and Error Analysis II

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach

MACHINE LEARNING ADVANCED MACHINE LEARNING

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Pattern Recognition 2

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning and Deep Learning! Vincent Lepetit!

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Bayesian Methods: Naïve Bayes

Gradient Ascent Chris Piech CS109, Stanford University

Chapter 9. Non-Parametric Density Function Estimation

Automatic Control III (Reglerteknik III) fall Nonlinear systems, Part 3

Probability Models for Bayesian Recognition

Motivating the Covariance Matrix

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Analysis of Spectral Kernel Design based Semi-supervised Learning

Algorithm-Independent Learning Issues

Density Estimation. Seungjin Choi

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Chapter 9. Non-Parametric Density Function Estimation

Machine Learning Lecture 2

Recap from previous lecture

Fourier analysis of discrete-time signals. (Lathi Chapt. 10 and these slides)

Machine Learning Lecture 3

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

An Introduction to Machine Learning

Ways to make neural networks generalize better

Data Mining and Analysis: Fundamental Concepts and Algorithms

ETIKA V PROFESII PSYCHOLÓGA

Lecture 2 Machine Learning Review

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Naïve Bayes classification

Framework for functional tree simulation applied to 'golden delicious' apple trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Statistical learning. Chapter 20, Sections 1 4 1

Q Scheme Marks AOs. Notes. Ignore any extra columns with 0 probability. Otherwise 1 for each. If 4, 5 or 6 missing B0B0.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA 414/2104, Spring 2014, Practice Problem Set #1

Manifold Regularization

Introduction to Error Analysis

Nonparametric Methods

Applications of Light-Front Dynamics in Hadron Physics

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Transcription:

Nonparametric Techniques for Density Estimation (DHS Ch. 4) n Introduction n Estimation Procedure n Parzen Window Estimation n Parzen Window Example n K n -Nearest Neighbor Estimation Introduction Suppose you don t know the form of the densities Try to estimate p(x S k ) (=likelihood) or P(S k x) (= a posteriori) è Estimation of probability density functions. Consider sample vector x 1, x 2,, x J, drawn from a class independently with probability density p(x). The probability that a vector x lies in a region R is P = ò R p(x )dx P is a smoothed or averaged version of the density function p(x ) [Consider Binomial Case] Probability that k of the vectors lie in R is binomial (if samples drawn i.i.d.) P k = J ö k J -k æ ç P è k ø ( 1- P) probability of k samples out of J fall in R. P = probability that it lies in R 1-P = probability that it doesn t lie in R Mean E{k} = JP k/j = reasonable estimate for P. Assume region R is small and has a volume V (if we can find). P = ò R p(x )dx» p(x)v p(x)» P/V = ò R p(x )dx / ò R dx Leads to an estimate p(x)» (k/j) / V Would like to take limit V-> to reduce smoothing of p(x ) but number of samples is finite. 1

Some Problems If we fix the volume and take more training samples => the ratio k/j will converge, but p(x) will be space-averaged value of p(x). If we shrink V to zero and fix the number of n samples, R becomes too small to enclose any samples in V, making p(x) close to zero. Estimation Procedure to Estimate the Density of x 1. Form a sequence of regions R 1, R 2, 2. Region R is employed for samples 3. Let V be the volume of R 4. Let k be the number of samples falling in R 5. The -th estimate of p(x) is p (x)=[k /]/ V. p (x) will converge to p(x) if: (1) lim V = 0 -> (2) lim k = -> (3) lim / = 0 k -> 2

Many ways of satisfying these conditions: 1. Shrink the regions, say V = 1/ (Parzen window) 2. Let k =, and let the volume grow to enclose k neighbors of x. (nearest neighbor) Concepts of General Techniques Estimate p(x) from samples x Technique (1): Histogram, fixed bin size and location Count number samples that fall into each bin. (crude estimate) 3

Technique (2): Fixed bin size, variable bin location (i.e., sliding bins) Count number samples that fall into region centered at x, for each x. Technique (3): Bin locations set by samples, bin shape is a parameter. Each sample x i gives rise to a window function centered about x i. Estimate p(x) by summing over window functions. Window function D(x-x i ) p (x) = 1/ å D( x - i= 1 (2) and (3) are equivalent for certain choices of D. x ) 4

Two Popular Techniques in Nonparametric Techniques 1) Parzen Window Estimation 2) Nearest Neighbor 1) Parzen Window Estimation (DHS 4.3) Define a window function D(u)=D(x-x i ) Estimate p(x). Given a sample x=x i, p(x i ) is nonzero, and if p(x) is continuous, p(x) is nonzero for x close to x i Use window function D(x-x i ) centered at x i. D should be non-increasing. Estimate of p(x) is p (x) = (1/) å i= 1 D(x-x i ) (Parzen window estimate) Note: if choose D(x-x i )=(1/V )[ (x-x )/h ] Then the Parzen window estimate: p (x) = (1/) å i= 1 (1/V )[ (x-x )/h ] = bar graph estimate. D is an interpolation function. This function gives a more general approach to estimating density functions. To ensure that p (x) represents a density, require: (*) D(u)³0 ò D(u)du = 1 5

Let D (x) = (1/V ) F(x) If V =h d Choice of scale or width of D (x) is important. h affects both the amplitude and the width. Small width => high resolution in p (x), but noisy Large width => p (x) will be over-smoothed. 6

Choice of h or V affects on p (x). If V is too large, the estimate will suffer from too little resolution. If V is too small, the estimate will suffer from too much statistical variability. With a limited number of samples, the best is to accept compromise. With an unlimited number of samples, let V slowly approach zero as increases and have p (x) converges to the unknown density p(x). 7

Parzen Window Example (DHS 4.3.3) Unknown density p(x) is normal. p(x) = N(N, m, s 2 ) zero-mean, unit-variance, univariate normal density Choose a window function: F(u) = 1/(sqrt(2p)) exp { (-1/2)u 2 } D (x-x i )=(1/h ) F [(x-x )/h ] æ ö = 1/(sqrt(2p)h ) exp { (-1/2) ç x - xi ç è h ø 2 } Window width = h = h 1 /sqrt()) h 1 is a parameter at our disposal. p (x) = (1/)å i= 1 D (x-x i ) 1 p ( x) = ì 2 ü 1 ï 1 æ ö ï í ç x - xi exp - ý = 1 2p h ï 2 î è h ø ï þ å i (x i is an observed sample) 8

9

10

Classification Example Classifier based on Parzen-window estimation. Estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior. Figure 4.8 Decision regions for a Parzen-window classifier depend upon the choices of window function. In general, the training error can be made arbitrarily low by making the window width sufficiently small. But a low training error does not guarantee a small test error. Curse of dimensionality: demand for a larger number of samples grows exponentially with the dimensionality of the feature space. 11

Parzen window techniques: advantages and disadvantages Advantage - Generality: No a prior assumptions (except continuity of p(x)). Given enough samples, it is guaranteed to converge to correct density p(x). Disadvantages - Number of samples required is generally quite large - Number of samples required grows exponentially with the number of dimensions in feature space. - Choice of sizes of regions V is important. Choosing the window function (DHS 4.3.6) One of problems in Parzen-window approach is the choice of the sequence of cell-volume sizes V 1, V 2, or overall window size. If V = V 1 /sqrt(), the results of any finite will be sensitive to the choice of the initial volume V 1 If V 1 is too small, most of the volume will be empty If V 1 is too large, important spatial variations in p(x) could be lost due to averaging. 12