Nonparametric Techniques for Density Estimation (DHS Ch. 4) n Introduction n Estimation Procedure n Parzen Window Estimation n Parzen Window Example n K n -Nearest Neighbor Estimation Introduction Suppose you don t know the form of the densities Try to estimate p(x S k ) (=likelihood) or P(S k x) (= a posteriori) è Estimation of probability density functions. Consider sample vector x 1, x 2,, x J, drawn from a class independently with probability density p(x). The probability that a vector x lies in a region R is P = ò R p(x )dx P is a smoothed or averaged version of the density function p(x ) [Consider Binomial Case] Probability that k of the vectors lie in R is binomial (if samples drawn i.i.d.) P k = J ö k J -k æ ç P è k ø ( 1- P) probability of k samples out of J fall in R. P = probability that it lies in R 1-P = probability that it doesn t lie in R Mean E{k} = JP k/j = reasonable estimate for P. Assume region R is small and has a volume V (if we can find). P = ò R p(x )dx» p(x)v p(x)» P/V = ò R p(x )dx / ò R dx Leads to an estimate p(x)» (k/j) / V Would like to take limit V-> to reduce smoothing of p(x ) but number of samples is finite. 1
Some Problems If we fix the volume and take more training samples => the ratio k/j will converge, but p(x) will be space-averaged value of p(x). If we shrink V to zero and fix the number of n samples, R becomes too small to enclose any samples in V, making p(x) close to zero. Estimation Procedure to Estimate the Density of x 1. Form a sequence of regions R 1, R 2, 2. Region R is employed for samples 3. Let V be the volume of R 4. Let k be the number of samples falling in R 5. The -th estimate of p(x) is p (x)=[k /]/ V. p (x) will converge to p(x) if: (1) lim V = 0 -> (2) lim k = -> (3) lim / = 0 k -> 2
Many ways of satisfying these conditions: 1. Shrink the regions, say V = 1/ (Parzen window) 2. Let k =, and let the volume grow to enclose k neighbors of x. (nearest neighbor) Concepts of General Techniques Estimate p(x) from samples x Technique (1): Histogram, fixed bin size and location Count number samples that fall into each bin. (crude estimate) 3
Technique (2): Fixed bin size, variable bin location (i.e., sliding bins) Count number samples that fall into region centered at x, for each x. Technique (3): Bin locations set by samples, bin shape is a parameter. Each sample x i gives rise to a window function centered about x i. Estimate p(x) by summing over window functions. Window function D(x-x i ) p (x) = 1/ å D( x - i= 1 (2) and (3) are equivalent for certain choices of D. x ) 4
Two Popular Techniques in Nonparametric Techniques 1) Parzen Window Estimation 2) Nearest Neighbor 1) Parzen Window Estimation (DHS 4.3) Define a window function D(u)=D(x-x i ) Estimate p(x). Given a sample x=x i, p(x i ) is nonzero, and if p(x) is continuous, p(x) is nonzero for x close to x i Use window function D(x-x i ) centered at x i. D should be non-increasing. Estimate of p(x) is p (x) = (1/) å i= 1 D(x-x i ) (Parzen window estimate) Note: if choose D(x-x i )=(1/V )[ (x-x )/h ] Then the Parzen window estimate: p (x) = (1/) å i= 1 (1/V )[ (x-x )/h ] = bar graph estimate. D is an interpolation function. This function gives a more general approach to estimating density functions. To ensure that p (x) represents a density, require: (*) D(u)³0 ò D(u)du = 1 5
Let D (x) = (1/V ) F(x) If V =h d Choice of scale or width of D (x) is important. h affects both the amplitude and the width. Small width => high resolution in p (x), but noisy Large width => p (x) will be over-smoothed. 6
Choice of h or V affects on p (x). If V is too large, the estimate will suffer from too little resolution. If V is too small, the estimate will suffer from too much statistical variability. With a limited number of samples, the best is to accept compromise. With an unlimited number of samples, let V slowly approach zero as increases and have p (x) converges to the unknown density p(x). 7
Parzen Window Example (DHS 4.3.3) Unknown density p(x) is normal. p(x) = N(N, m, s 2 ) zero-mean, unit-variance, univariate normal density Choose a window function: F(u) = 1/(sqrt(2p)) exp { (-1/2)u 2 } D (x-x i )=(1/h ) F [(x-x )/h ] æ ö = 1/(sqrt(2p)h ) exp { (-1/2) ç x - xi ç è h ø 2 } Window width = h = h 1 /sqrt()) h 1 is a parameter at our disposal. p (x) = (1/)å i= 1 D (x-x i ) 1 p ( x) = ì 2 ü 1 ï 1 æ ö ï í ç x - xi exp - ý = 1 2p h ï 2 î è h ø ï þ å i (x i is an observed sample) 8
9
10
Classification Example Classifier based on Parzen-window estimation. Estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior. Figure 4.8 Decision regions for a Parzen-window classifier depend upon the choices of window function. In general, the training error can be made arbitrarily low by making the window width sufficiently small. But a low training error does not guarantee a small test error. Curse of dimensionality: demand for a larger number of samples grows exponentially with the dimensionality of the feature space. 11
Parzen window techniques: advantages and disadvantages Advantage - Generality: No a prior assumptions (except continuity of p(x)). Given enough samples, it is guaranteed to converge to correct density p(x). Disadvantages - Number of samples required is generally quite large - Number of samples required grows exponentially with the number of dimensions in feature space. - Choice of sizes of regions V is important. Choosing the window function (DHS 4.3.6) One of problems in Parzen-window approach is the choice of the sequence of cell-volume sizes V 1, V 2, or overall window size. If V = V 1 /sqrt(), the results of any finite will be sensitive to the choice of the initial volume V 1 If V 1 is too small, most of the volume will be empty If V 1 is too large, important spatial variations in p(x) could be lost due to averaging. 12