Introduction to local (nonparametric) density estimation. methods

Similar documents
Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

6. Nonparametric techniques

Functions of Random Variables

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Nonparametric Density Estimation Intro

Bayes (Naïve or not) Classifiers: Generative Approach

CHAPTER VI Statistical Analysis of Experimental Data

Summary of the lecture in Biostatistics

X ε ) = 0, or equivalently, lim

Chapter 5 Properties of a Random Sample

The Mathematical Appendix

Point Estimation: definition of estimators

Lecture 7: Linear and quadratic classifiers

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Special Instructions / Useful Data

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Unsupervised Learning and Other Neural Networks

An Introduction to. Support Vector Machine

L5 Polynomial / Spline Curves

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Dimensionality Reduction and Learning

PTAS for Bin-Packing

CHAPTER 3 POSTERIOR DISTRIBUTIONS

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Simulation Output Analysis

Class 13,14 June 17, 19, 2015

Lecture 3. Sampling, sampling distributions, and parameter estimation

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

MIMA Group. Chapter 4 Non-Parameter Estimation. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Chapter 3 Sampling For Proportions and Percentages

Econometric Methods. Review of Estimation

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

A Combination of Adaptive and Line Intercept Sampling Applicable in Agricultural and Environmental Studies

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Outline. Point Pattern Analysis Part I. Revisit IRP/CSR

Ideal multigrades with trigonometric coefficients

LINEAR REGRESSION ANALYSIS

Chapter 11 Systematic Sampling

Parameter, Statistic and Random Samples

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Lecture 3 Probability review (cont d)

ENGI 3423 Simple Linear Regression Page 12-01

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

D. VQ WITH 1ST-ORDER LOSSLESS CODING

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

Chapter 14 Logistic Regression Models

Applications of Multiple Biological Signals

1 Onto functions and bijections Applications to Counting

Block-Based Compact Thermal Modeling of Semiconductor Integrated Circuits

Lecture 02: Bounding tail distributions of a random variable

Lecture 9: Tolerant Testing

1. BLAST (Karlin Altschul) Statistics

Random Variables and Probability Distributions

Module 7: Probability and Statistics

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Chapter 9 Jordan Block Matrices

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

A practical threshold estimation for jump processes

22 Nonparametric Methods.

6.867 Machine Learning

Generative classification models

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

A New Family of Transformations for Lifetime Data

A tighter lower bound on the circuit size of the hardest Boolean functions

Unit 9. The Tangent Bundle

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Lecture 07: Poles and Zeros

STK4011 and STK9011 Autumn 2016

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Chapter 8. Inferences about More Than Two Population Central Values

Median as a Weighted Arithmetic Mean of All Sample Observations

Nonparametric Techniques

The E vs k diagrams are in general a function of the k -space direction in a crystal

Multivariate Transformation of Variables and Maximum Likelihood Estimation

5 Short Proofs of Simplified Stirling s Approximation

TESTS BASED ON MAXIMUM LIKELIHOOD

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Chapter 4 Multiple Random Variables

Naïve Bayes MIT Course Notes Cynthia Rudin

18.413: Error Correcting Codes Lab March 2, Lecture 8

Chapter 8: Statistical Analysis of Simulated Data

MATH 247/Winter Notes on the adjoint and on normal operators.

Pseudo-random Functions

Module 7. Lecture 7: Statistical parameter estimation

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Arithmetic Mean and Geometric Mean

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Lecture 2 - What are component and system reliability and how it can be improved?

Lecture 12: Multilayer perceptrons II

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

9.1 Introduction to the probit and logit models

Transcription:

Itroducto to local (oparametrc) desty estmato methods A slecture by Yu Lu for ECE 66 Sprg 014

1. Itroducto Ths slecture troduces two local desty estmato methods whch are Parze desty estmato ad k-earest eghbor desty estmato. Local desty estmato s also referred to as o-parametrc desty estmato. To make thgs clear, let s frst look at parametrc desty estmato. I parametrc desty estmato, we ca assume that there exsts a desty fucto whch ca be determed by a set of parameters. The set of parameters are estmated from the sample data ad are later used desgg the classfer. However, some practcal stuatos the assumpto that there exsts a parametrc form of the desty fucto does ot hold true. For example, t s very hard to ft a multmodal probablty dstrbuto wth a smple fucto. I ths case, we eed to estmate the desty fucto the oparametrc way, whch meas that the desty fucto s estmated locally based o a small set of eghborg samples. Because of ths localty, local (oparametrc) desty estmato s less accurate tha parametrc desty estmato. I the followg text the word local s preferred over oparametrc. It s oteworthy that t s very dffcult to obta a accurate local desty estmato, especally whe the dmeso of the feature space s hgh. So why do we bother usg local desty estmato? Ths s because our goal s ot to get a accurate estmato, but rather to use the estmato to desg a well performed classfer. The accuracy of local desty estmato does ot ecessarly lead to a poor decso rule.. Geeral Prcple I local desty estmato the desty fucto p (x) ca be approxmated by k p ( x) (1) v where v s the volume of a small rego R aroud pot x, s the total umber of samples x ( =1,, ) draw accordg to p (x), ad k s the umber of x s whch fall to rego R. The reaso why p (x) ca be calculated ths way s that p (x) does ot vary much wth a relatvely small rego, thus the probablty mass of rego R ca be approxmated by p (x)v, whch equals k /. Some examples of rego R dfferet dmesos: ) le segmet oe-dmeso, ) crcle or rectagle two-dmeso, ) sphere or cube three-dmeso, v) hyper sphere or hypercube d-dmeso (d > 3). Three codtos we eed to pay atteto to whe usg formula (1) are: ) lm v 0. Ths s because f v s fxed, the p (x) oly represets the average probablty desty as grows larger, but what we eed s the pot probablty desty,

so we should have v 0 whe. ) lm k. Ths s to make sure that we do ot get zero probablty desty. ) lm k / 0. Ths s to make sure that p (x) does ot dverge. 3. Parze Desty Estmato I Parze desty estmato v s drectly determed by whle k s a radom varable whch deotes the umber of samples that fall to v. Assume that the rego R s a d-dmesoal hypercube wth ts edge legth h, thus v = (h ) d The equvalet codtos whch meet the aforemetoed three codtos are: lm v 0 ad lm v Therefore v ca be chose as v h / or v h / l, where h s a adjustable costat. Now that the relatoshp betwee v ad s defed, the ext step s to determe k. To determe k, we defe a wdow fucto as follows: x x 1 h 0 x x h else 1 where x s ( = 1,,, ) are the gve samples ad x s the pot where the desty s to be estmated. Thus we have k x x 1 h k 1 x x p ( x) v v 1 h The fucto s called a Parze wdow fucto, whch eables us to cout the umber of sample pots the hypercube wth ts edge legth h. Accordg to [], usg hypercube as the wdow fucto may lead to dscotuty the estmato. Ths s due to the supermposto of sharp pulses cetered at the gve sample pots whe h s small. To overcome ths shortcomg, we ca cosder a more geeral form of wdow fucto rather tha the hypercube. Note that f the followg two codtos are met, the estmated p (x) s guarateed to be proper.

( x) 0 ad ( x) dx 1 Therefore a better choce of wdow fucto whch removes dscotuty ca be Gaussa wdow: The estmated desty s gve by x x 1 1 x x exp h h p x 1 1 1 x x exp () v 1 h Cosder a oe-dmeso case, assume that v h /, thus h v h /, where h s a adjustable costat. Substtute to formula () we have 1 1 1 x x 1 1 1 x x p ( x) exp exp v 1 h h 1 h / We ca see that f equals oe, p (x) s just the wdow fucto. If approaches fty, p (x) ca coverge to ay complex form. If s relatvely small, p (x) s very sestve to the value of h. I geeral small h leads to the ose error whle large h leads to the over-smoothg error, whch ca be llustrated by the followg example. I ths expermet samples are 5000 pots o -D plae wth Gaussa dstrbuto. The mea vector s [1 ], ad the covarace matrx s [1 0; 0 1]. Choose rectagle 4 Parze wdow wth h 4/, thus v ( h) 16 /. Fg. 1 shows the sample dstrbuto. Fg. shows the deal probablty desty dstrbuto. Fg. 3 shows the result of Parze desty estmato. Fgure 1. 5000 sample pots o -D plae wth Gaussa dstrbuto

Fgure. The deal probablty desty dstrbuto Fgure 3. The result of Parze desty estmato Next we chage the value of h ad see how t affects the estmato. Fg. 4 shows the result of Parze desty estmato whe h s twce ts tal value. Fg. 5 shows the result of Parze desty estmato whe h s ts tal value dvded by two. We ca see that the results agree wth the aforesad property of h.

Fgure 4. The result of Parze desty estmato whe h s twce ts tal value Fgure 5. The result of Parze desty estmato whe h s ts tal value dvded by two To desg a classfer usg Parze wdow method [3], we estmate the destes for each class ad classfy the test pot by the label correspodg to the maxmum posteror. Below lsts some advatages ad dsadvatages of Parze desty estmato: Advatages: ) p (x) ca coverge to ay complex form whe approaches fty; ) applcable to data wth ay dstrbuto. Dsadvatages: ) eed a large umber of samples to obta a accurate estmato; ) computatoally expesve, ot sutable for feature space wth very hgh dmesos;

) the adjustable costat h has a relatvely heavy fluece o the decso boudares whe s small, ad s ot easy to choose practce. 4. K-Nearest Neghbor Desty Estmato I k-earest eghbor desty estmato (use acroym k-nn the followg text) k s drectly determed by whle v s a radom varable whch deotes the volume that ecompasses just k sample pots sde v ad o ts boudary. If v s a sphere, t ca be gve by h ( hk ) vk ( x) m h ( 1) ( 1) where h s the radus of the sphere wth ceter x. h k equals x lk - x where x lk s the k th closest sample pot to x. The the probablty desty at x s approxmated by px ( ) k v ( x) 1 (3) where k 1 s umber of sample pots o the boudary of v k (x). Most of the tme formula (3) ca be rewrtte as It ca be proved that E[ p( x)] p( x). k 1 p( x) ( k ) v ( x) k I Parze desty estmato v oly depeds o ad s the same for all the test pots, whle k-nn v s smaller at hgh desty area ad s larger at low desty area. Ths strategy seems more reasoable tha the strategy to determe v Parze desty estmato sce ow v s adaptve to the local desty. I practce, whe we wat to classfy data usg k-nn estmato, t turs out that we ca get the posteror p(w x) drectly wthout worryg about p(x). If we have k samples fall to volume v aroud pot x, ad amog the k samples there are k samples belogg to class w, the we have p w x k, k v The posteror p(w x) s gve by p w x,, p w x p w x k m P( x) k p w, x j1 j (4) where m s the umber of classes. Formula (4) tells us oe smple decso rule: the

class of a test pot x s the same as the most frequet oe amog the earest k pots of x. Smple ad tutve, s t t? Havg sad that, choosg k k-nn s stll a otrval problem as choosg h Parze desty estmato. Small k leads to osy decso boudares whle large k leads to over-smoothed boudares, whch s llustrated by the followg example. I ths expermet samples are 00 pre-labeled (red or blue) pots. The task s to fd the classfcato boudares uder dfferet k values. Fg. 6-9 show the results. Fgure 6. k-nn decso boudares expermet (k=) Fgure 7. k-nn decso boudares expermet (k=3)

Fgure 8. k-nn decso boudares expermet (k=5) Fgure 9. k-nn decso boudares expermet (k=8) I practce we ca use cross-valdato to choose the best k. Below lsts some advatages ad dsadvatages of k-nn: Advatages: ) decso performace s good f s large eough; ) applcable to data wth ay dstrbuto; ) smple ad tutve. Dsadvatages: ) eed a large umber of samples to obta a accurate estmato, whch s evtable local desty estmato; ) computatoally expesve, low effcecy for feature space wth very hgh dmesos; ) choosg the best k s otrval.

5. Referece [1] Mrelle Bout, ECE66: Statstcal Patter Recogto ad Decso Makg Processes, Purdue Uversty, Sprg 014 [] http://www.cse.buffalo.edu/~jcorso/t/cse555/fles/aote_8feb_oprm.pdf [3] http://www.csd.uwo.ca/~olga/courses/cs434a_541a/lecture6.pdf