12 - Nonparametric Density Estimation

Size: px

Start display at page:

Download "12 - Nonparametric Density Estimation"

Todd Norman
5 years ago
Views:

1 ST 697 Fall / Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama

2 Density Review ST 697 Fall /49

3 Continuous Random Variables ST 697 Fall / F(x) f(x) x

4 Discrete Random Variables ST 697 Fall / F(x) p(x) x

5 Empirical CDF and PDF Gaussian CDF (µ=0, σ=1) CDF ECDF F(x) f_n(x) x ECDF of n = 50 from X N(0, 1) ST 697 Fall /49

6 Review of Parametric Density Estimation ST 697 Fall /49 1. Choose parametric distribution family 2. Estimate parameters Method of Moments Maximum Likelihood Bayesian (also choose prior)

7 Density Histograms ST 697 Fall /49

8 Density Histograms ST 697 Fall /49 Histograms estimate the density as a piecewise constant function. J ˆf(x) = b j (x) ˆθ j j=1 where b j (x) = 1(x bin j )/h j and bin j = [t j, t j+1 ) t 1 < t 2 <... < t J are the break points for the bins h j = [t j, t j+1 ) = t j+1 t j is the bin width of bin j bin widths do not have to be equal bin j bin k = Some slight adjustments need to be made for multivariate

9 Estimating Density Histograms ST 697 Fall /49 Observe data D = {X 1, X 2,..., X n } Denote n j as the number of observations in bin j ˆθj = ˆp j = n j /n is the usual (MLE) estimate Shrinkage Estimator ˆθ j = ( ) ( ) n A ˆp j + u j n + A n + A where = πˆp j + (1 π)u j A (number of pseudo observations in bin j) u j = h j / k h k (uniform prior) 0 π 1 (alternative representation) What are constraints on {θ j } that ensure ˆf is a proper density?

10 Regular Histogram ST 697 Fall / Density x

11 Histogram with Shrinkage ST 697 Fall / MLE Prior Shrinkage A=1 Shrinkage A= Density x

12 German Credit Data ST 697 Fall /49 statlog/german/german.data url = " data = read.delim(url,sep=" ",header=false) germancredit = data[,-21] Y = data[,21] G = ifelse(y==1, "good", "bad") good = (G == "good")

13 Histogram Density Ratio: German Credit Data ST 697 Fall /49 density x2 3 2 shrinkage a=1 log(f1/f0) x2

14 Binning ST 697 Fall /49 The bins of a histogram can be of fixed width h j = h or variable width. Smaller bins have less bias, but more variance Shrinkage lowers variance, but increases bias Percentile binning set the bin width according to the number of observations Creating J bins {t j : j = 1, 2,..., J + 1}: = {t 1, h, J} (equal bin width of width h) = {t 1, t J+1, J} (equal bin width h = (t J+1 t 1 )/J) = {t j : t j = F 1 n (j/j)} (percentile binning)

15 Binning Example ST 697 Fall / Histogram Truth Regular Binning 0.3 Density Histogram Truth Percentile Binning Density

16 Binning Example: Basis and Parameters ST 697 Fall /49 Fixed Binning Basis Functions Percentile Binning Basis Functions b(x) b(x) x x Parameters Fixed Binning Parameters Percentile Binning θ θ x x

17 Optimal Bin Width ST 697 Fall /49 The optimal bin widths for density estimation may not be the same as the optimal bin widths for finding the log density ratio (like we need for naive Bayes or anomaly detection). If we are interested in the log density ratio, should we use the same binning? What if unbalanced classes?

18 Histogram Bin Width: German Credit Data ST 697 Fall /49 J = 10 Bins Fixed bin width Percentile bin width density 0.00 density x log(f1/f0) 0 log(f1/f0)

19 Histogram Bin Width: German Credit Data ST 697 Fall /49 J = 20 Bins Fixed bin width Percentile bin width density 0.00 density x log(f1/f0) 0 log(f1/f0)

20 Histogram Bin Width: German Credit Data ST 697 Fall /49 J = 5 Bins Fixed bin width Percentile bin width density 0.00 density x log(f1/f0) 0 log(f1/f0)

21 Review of Histograms ST 697 Fall /49 We have considered regular and percentile histograms for nonparametric density estimation Histograms can be thought of as a local method of density estimation. Points are local to each other if they fall in the same bin Local is determined by bin breaks But this has some issues: Some observations closer" to observations in a neighboring bin Estimate is not smooth (but true density can often be assumed smooth) Bin shifts can have a big influence on the resulting density Do parametric approaches estimate a density locally?

22 Histogram Neighborhood Consider estimating the density at a location x 0 For a regular histogram (with bin width h), the MLE density is ˆf(x 0 ) = n j for x 0 bin j nh which is a function of the number of observations in bin j. But how do you feel if x 0 is close to the boundary? Histogram of x Density x ST 697 Fall 2017 x 22/49

Buffalo Snowfall (1910-1973): Bin width ST 697 Fall 2017

23 Buffalo Snowfall ( ): Bin width ST 697 Fall /49 Scott (1992) Multivariate Density Estimation, Chapter 4

Buffalo Snowfall (1910-1973): Sensitivity to bin shifts ST 697

24 Buffalo Snowfall ( ): Sensitivity to bin shifts ST 697 Fall /49 Scott (1992) Multivariate Density Estimation, Chapter 4

25 Kernel Density Estimation ST 697 Fall /49

26 Local Density Estimation - Moving Window Consider again estimating the density at a location x 0 - Regular Histogram (with midpoints m j and bin width h) ( ) ˆf(x 0 ) = 1 n 1 x i m j h 2 for x 0 B j n h i=1 Consider a moving window approach ( ) ˆf(x 0 ) = 1 n 1 x i x 0 h 2 n h i=1 This gives a more pleasing definition of local by centering a bin at x 0. Equivalently this estimates the derivative of ECDF ˆf(x 0 ) = F n(x 0 + h/2) F n (x 0 h/2) h ST 697 Fall /49

27 Uniform Kernel ST 697 Fall /49 Uniform Kernel 0.4 Histogram Uniform Kernel Truth 0.3 Density x

28 Kernel Density Estimation ST 697 Fall /49 The moving window approach looks better than a histogram with the same bin width, but it is still not smooth Instead of giving every observation in the window the same weight, we can assign a weight according to its distance from x Uniform Gaussian 1.5 kernel weight distance from x_0

29 Gaussian Kernel ST 697 Fall /49 Gaussian Kernel 0.4 Histogram Gaussian Kernel Uniform Kernel Truth 0.3 Density x

30 Kernel Density Estimation ST 697 Fall /49 More generally, the weights K h (u) = h 1 K ( u h) are called kernel functions Thus, a kernel density estimator is of the form ˆf(x 0 ) = 1 n K h (x i x 0 ) n i=1 where the smoothing parameter h is called the bandwidth and controls how fast the weights decay as a function from x 0 In R, density() function uses a bandwidth (bw argument) that is the standard deviation of the kernel

31 Kernel Properties ST 697 Fall /49 A kernel is usually considered to be a symmetric probability density function: K h (u) 0 K h (u) du = 1 (non-negative) (integrates to one) K h (u) = K h ( u) (symmetric about 0) Notice that if the kernel has compact support, so does the resulting density estimate The Gaussian kernel is the most popular, but has infinite support The is good when the true density has infinite support However, this requires more computation But easier to calculate properties (e.g., bandwidth selection)

32 Popular Kernels ST 697 Fall /49 bw = gaussian epanechnikov rectangular triangular biweight optcosine kernel weight distance from x_0

33 Bandwidth ST 697 Fall /49 The bandwidth parameter, h controls the amount of smoothing What happens when h? What happens when h 0? The choice of bandwidth is much more important than the choice of kernel There is no standard representation of the bandwidth parameter h In R, density() uses two arguments to set the bandwidth: bw is the standard deviation of the kernel, and width is the length of the support of the kernel The two are very different so check carefully what an author is using in their definition of bandwidth

34 Bandwidth Effects ST 697 Fall /49 gaussian kernel, bw = 0.1 gaussian kernel, bw = 0.3 gaussian kernel, bw = density density density x x x biweight kernel, bw = 0.1 biweight kernel, bw = 0.3 biweight kernel, bw = density density density x x x

35 Another Perspectives ST 697 Fall /49 We have described KDE as taking a local density around location x 0 An alternative perspective is to view KDE as an n component mixture model with mixture weights of 1/n ˆf(x) = 1 n K h (x i x) n i=1 n 1 = n f i(x) j=1 ( fi (x) = K h (x i x) ) Or in a basis function representation n ( ˆf(x) = θ j b i (x) θj = 1 n, b i(x) = K h (x i x) ) j=1

36 Component Kernels ˆf(x) = 1 n K h (x i x) n i= density x ST 697 Fall /49

37 Kernel Based Density Ratio ST 697 Fall /49 Using kernel density estimation (KDE), the density ratio useful for naive Bayes, etc. becomes f 1 (x) f 0 (x) ˆ= ˆf 1 (x) ˆf 0 (x) = 1 n1 n 1 i=1 K h 1 (x i x) 1 n0 n 0 j=1 K h 0 (x j x)

38 KDE: German Credit Data ST 697 Fall /49 bw1 = 3, bw0 = 3 bw1 = 10, bw0 = 10 bw1 = 1, bw0 = density 0.02 density density x x log(f1/f0) 0 log(f1/f0) 0 log(f1/f0)

39 Edge Effects ST 697 Fall /49 Sometimes there are known boundaries in the data (e.g., the amount of rainfall cannot be negative). Here are some options: 1. Do nothing - as long as not many events are near the boundary and the bandwidth is small, this may not be too problematic. However, it will lead to an increased bias around the boundaries. 2. Transform the data (e.g., x = log(x)), estimate the density is the transformed space, then transform back 3. Use an edge correction technique

40 Edge Correction (1) ST 697 Fall /49 The simplest approach requires a modification of the kernels near the boundary. Let S = [a, b]. Recall that b a K h(x i x) dx should be 1 for every i. But near a boundary b a K h(x i x) dx 1 Denote w h (x i ) = b a K h(x i x) dx The resulting edge corrected KDE equation becomes ˆf(x) = 1 n w h (x i ) 1 K h (x i x) n i=1

41 Edge Correction (2) ST 697 Fall /49 Another approach corrects the kernel for each particular x Denote w h (x) = b a K h(u x) du The resulting edge corrected KDE equation becomes ˆf(x) = 1 n w h (x) 1 K h (x i x) n i=1 This approach is not guaranteed to integrate to 1, but for some problems this is not a major concern

42 Adaptive Kernels Up to this point, we have considered fixed bandwidths. But what if we let the bandwidth vary? There are two main approaches: Balloon Estimator ˆf(x) = 1 n K n h(x) (x i x) i=1 = 1 n ( ) xi x K nh(x) h(x) i=1 Sample Point Estimator \end{frame} ˆf(x) = 1 n K n h(xi )(x i x) i=1 = 1 n ( ) h(x i ) 1 xi x K n h(x i=1 i ) ST 697 Fall /49

43 k-nearest Neighbor Density Approach ST 697 Fall /49 Like what we discussed for percentile binning, we can estimate the density from the size of the window containing the k nearest observations. k ˆf k (x) = nv k (x) where V k (x) is the volume of a neighborhood that contains the k-nearest neighbors. This is an adaptive version of the moving window (uniform kernel) approach Probably won t integrate to 1 It is also possible to use the k-nn distance as a way to select an adaptive bandwidth h(x).

44 Multivariate Density Estimation ST 697 Fall /49 1. Multivariate kernels (e.g., K(u) = N(0, Σ)) ˆf(x) = 1 (2π) d/2 Σ 1/2 n n i=1 ( exp 1 ) 2 (x x i) T Σ 1 (x x i ) Let Σ = h 2 A where A = 1, thus Σ = h 2d ˆf(x) = 1 (2π) d/2 h d n n i=1 ( exp 1 ) 2 (x x i) T A 1 (x x i ) 2. Product Kernels (A = I d ) ˆf(x) = 1 n d K hj (x j x ij ) n i=1 j=1

45 Mixture Models ST 697 Fall /49 Mixture models offer a flexible compromise between kernel density and parametric methods. A mixture model is a mixture of densities where p f(x) = π j g j (x ξ j ) j=1 0 π j 1 and p j=1 π j are the mixing proportions g j (x ξ j ) are the component densities This idea is behind model-based clustering (ST 640), radial basis functions (ESL 6.7), etc. Usually the parameters θ j and weights π j have to be estimated (EM algorithm shows up here)

46 Kernels, Mixtures, and Splines ST 697 Fall /49 All of these methods can be written: J f(x) = b j (x)θ j j=1 For KDE: J = n, b j (x) = K h (x x j ), θ j = 1/n (bw h estimated) For Mixture Models: b j (x) = g j (x ξ j ), θ j = π j (J, ξ j, π j estimated) B-splines b j (x) is a B-spline, (θ j, and maybe J and knots, estimated) J Note: For density estimation, log f(x) = j=1 b j(x)θ j may be easier to estimate

47 Kernel Regression ST 697 Fall /49

48 Other Issues ST 697 Fall /49 One major problem with KDE and kernel regression (and k-nn) is that all of training data must be stored in memory. For large data, this is unreasonable Multi-dimensional kernels are not very good for high dimensions (unless simplified by using product kernels) But temporal kernels good for adaptive procedures (e.g., only remember most recent observations) Think about what EWMA is doing

49 Smoothing for Categorical Data ST 697 Fall /49

Statistics: Learning models from data

DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial