Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Size: px

Start display at page:

Download "Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)"

Dylan Harrington
5 years ago
Views:

1 Density Estimation Density Estimation: Deals with the problem of estimating probability density functions (PDFs) based on some data sampled from the PDF. May use assumed forms of the distribution, parameterized in some way (parametric statistics); or May avoid making assumptions about the form of the PDF (nonparametric statistics). We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics) 1 Frank Porter, SLUO Lectures on Statistics, August 2006

2 Some References (I) Richard A. Tapia & James R. Thompson, Nonparametric Density Estimation, Johns Hopkins University Press, Baltimore (1978). David W. Scott, Multivariate Density Estimation, John Wiley & Sons, Inc., New York (1992). Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Techniques for Data Analysis, Clarendon Press, Oxford (1997). B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman and Hall (1986); contents.html K. S. Cranmer, Kernel Estimation in High Energy Physics, Comp. Phys. Comm. 136, 198 (2001) [hep-ex/ v1]; cache/hep-ex/pdf/0011/ pdf 2 Frank Porter, SLUO Lectures on Statistics, August 2006

3 Some References (II) M. Pivk & F. R. Le Diberder, splot: a statistical tool to unfold data distributions, Nucl. Instr. Meth. A 555, 356 (2005). R. Cahn, How splots are Best (2005), rev splots best.pdf BaBar Statistics Working Group, Recommendations for Display of Projections in Multi-Dimensional Analyses, Statistics/Documents/MDgraphRec.pdf Additional specific references will noted in the course of the lectures. 3 Frank Porter, SLUO Lectures on Statistics, August 2006

4 Preliminaries We ll couch discussion in terms of observations (dataset) from some experiment. Our dataset consists of the values x i ; i =1, 2,...,n. Our dataset consists of repeated samplings from a (presumed unknown) probability distribution. IID Independent, Identically Distributed We ll note generalizations here and there. Order is not important; if we are discussing a time series, we could introduce ordered pairs {(x i,t i ),i = 1,...,n}, and call it two-dimensional [But beware the correlations then; probably not IID!]. In general, our quantities can be multi-dimensional; no special notation will be used to distinguish one- from multi-variate cases. We ll discuss where issues enter with dimensionality. 4 Frank Porter, SLUO Lectures on Statistics, August 2006

5 Notation At our convenience we may use E,, and all to mean expectation : E(x) x x xp(x)dx, where p(x) is the probability density function (PDF) for x (or, more generally p(x)dx μ(dx) is the probability measure). Estimators are denoted with a hat : In these lectures, we ll be concerned with estimators for the density function itself, hence p(x) is a random variable giving our estimate for p(x). We will not be especially rigorous. For example, we won t make a notational distinction between the random variable and an instance. 5 Frank Porter, SLUO Lectures on Statistics, August 2006

6 Motivation Why do we want to estimate densities? Well, that is the whole point... Harder question: Why non-parametric estimates? Comparison with models (which may be parametric) May be easier/better than parametric modeling for efficiency corrections and background subtraction Visualization Unfolding Comparing samples 6 Frank Porter, SLUO Lectures on Statistics, August 2006

7 R, A Toolkit, er, Language, You Might be Interested In... Histogram of x The S Language: developed with statistical analysis of data in mind. > x <- rnorm(100,10,1) > hist(x,xlim=range(5,15)) > Frequency x Free, open source version is R, fromther Project. Downloads available for Linux/MacOS X/Windows, e.g., at: Commercial version is S-Plus, at 7 Frank Porter, SLUO Lectures on Statistics, August 2006

8 Empirical Probability Density Function Place a delta function at each data point. The estimator is (EPDF, for Empirical Probability Density Function ) p(x) = 1 n n δ(x x i ). i= x Note that x could be multi-dimensional here. This is the sampling density for the bootstrap (more later; also see Ilya Narsky lectures). 8 Frank Porter, SLUO Lectures on Statistics, August 2006

9 The Histogram Perhaps our most ubiquitous density estimator is the histogram: h(x) = n B(x x i ; w), i=1 where x i is the center of the bin in which observation x i lies, w is the bin width, and B(x; w) ={ 1 x ( w/2,w/2) 0 otherwise (called the Indicator function in probability). 1 B(x-x ~ i ; w) x i w ~ x i x This is written for uniform bin widths, but may be generalized to differing widths with appropriate relative normalization factors. The estimator for the probability density function (PDF) is: p(x) = 1 nw h(x). 9 Frank Porter, SLUO Lectures on Statistics, August 2006

10 Histogram Example 6 5 Events/10 MeV x m(p pi) - m(p) - m(pi) Left: EPDF; Right: Histogram with w = 10 MeV. [Actual sampling is 100 points from a Δ(1232) Breit-Wigner (Cauchy) on a second-order polynomial background. Background probability is 50%.] 10 Frank Porter, SLUO Lectures on Statistics, August 2006

11 Criticisms of Histogram as Density Estimator Discontinuous even if PDF is continuous. Dependence on bin size and bin origin. Information from location of datum within a bin is ignored. 11 Frank Porter, SLUO Lectures on Statistics, August 2006

12 Kernel Estimation Take the histogram, but replace bin function B with something else: p(x) = 1 n k(x x n i ; w), i=1 where k(x, w) is the kernel function, normalized to unity: k(x; w) dx =1. Usually interested in kernels of the form k(x x i ; w) = 1 w K ( x xi w indeed this may be used as the definition of kernel. The kernel estimator for the PDF is then: p(x) = 1 n ( ) x xi K, nw w i=1 ), The role of parameter w as a smoothing parameter is clearer. 12 Frank Porter, SLUO Lectures on Statistics, August 2006

13 Multi-Variate Kernel Esitmation Explicit multi-variate case, d =2dimensions: p(x, y) = 1 nw x w y n K i=1 ( ) x xi K w x ( ) y yi. w y This is a product kernel form, with the same kernel in each dimension, except for possibly different smoothing parameters. It does not have correlations. The kernels we have introduced are classified more explicitly as fixed kernels : The smoothing parameter is independent of x. 13 Frank Porter, SLUO Lectures on Statistics, August 2006

14 Ideogram A simple variant on the kernel idea is to permit the kernel to depend on additional knowledge in the data. Physicists call this an ideogram. Most common is the Gaussian ideogram, in which each data point is entered as a Gaussian of area one and standard deviation appropriate to that datum. This addresses a way that the IID assumption might be broken. [Aside: Be careful to get your likelihood function right if you are incorporating variable resolution information in your fits; see, e.g., Punzi: ] 14 Frank Porter, SLUO Lectures on Statistics, August 2006

15 Sample Ideograms (I) WEIGHTED AVERAGE ±0.011 (Error scaled by 2.5) m K ± (MeV) Values above of weighted average, error, and scale factor are based upon the data in this ideogram only. They are not necessarily the same as our `best' values, obtained from a least-squares constrained fit utilizing measurements of other (related) quantities as additional information. 2 DENISOV GALL 88 K Pb GALL 88 K Pb GALL 88 K W GALL 88 K W LUM BARKOV CHENG 75 K Pb CHENG 75 K Pb CHENG 75 K Pb CHENG 75 K Pb CHENG 75 K Pb BACKENSTO (Confidence Level 0.001) (from RPP 2006) 15 Frank Porter, SLUO Lectures on Statistics, August 2006

16 Sample Ideograms (II) Note detailed comparison. Figure 1. A histogram of magnetic field values (black), compared with a smoothed frequency distribution constructed using a Gaussian ideogram technique (red). (from J. S. Halekas et al., Magnetic Properties of Lunar Geologic Terranes: New Statistical Results, Lunar and Planetary Science XXXIII (2002), 1368.pdf) 16 Frank Porter, SLUO Lectures on Statistics, August 2006

17 Parametric vs non-parametric Density Estimation (I) Distinction is fuzzy A histogram is non-parametric, in the sense that no assumption about the form of the sampling distribution is made. Often an implicit assumption that distribution is smooth on scale smaller than bin size. For example, we know something about the resolution of our apparatus. But the estimator of the parent distribution made with a histogram is parametric the parameters are populations (or frequencies) in each bin. The estimators for those parameters are the observed histogram populations. Even more parameters than a typical parametric fit! 17 Frank Porter, SLUO Lectures on Statistics, August 2006

18 Parametric vs non-parametric Density Estimation (II) Essence of difference may be captured in notions of local and nonlocal : If a datum at x i influences the density estimator at some other point x this is non-local. A non-parametric estimator is one in which the influence of a point at x i on the estimate at any x with d(x i,x) >ɛvanishes, asymptotically. Notice that for a kernel estimator, the bigger the smoothing paramter w, the more non-local the estimator, p(x) = 1 n ( ) x xi K. nw w i=1 As we ll discuss, the optimal choice of smoothing parameter depends on n. 18 Frank Porter, SLUO Lectures on Statistics, August 2006

19 Optimization We would like to make an optimal density estimate from our data. What does that mean? Need a criterion for optimal Choice of criterion is subjective; it depends on what you want to achieve. ^ f(x) We may compare the estimator for a quantity (here, value of Δ(x) the density at x) with the true f(x) value: Δ(x) = f(x) f(x). x 19 Frank Porter, SLUO Lectures on Statistics, August 2006

20 Mean Squared Error (I) A common choice in parametric estimation is to minimize the sum of the squares. We may take this idea over here, and form the Mean Squared Error (MSE): MSE[ f(x)] [ f(x) f(x) ] 2 =Var[ f(x)] + Bias 2 [ f(x)], where Var[ f(x)] E [ ( f(x) E[ f(x)] ) 2 ] Bias[ f(x)] E[ f(x)] f(x) 20 Frank Porter, SLUO Lectures on Statistics, August 2006

21 Mean Squared Error (II) Since this isn t quite our familiar parameter estimation, let s take a little time to make sure it is understood: Suppose p(x) is an estimator for the PDF p(x), based on data {x i ; i = 1,...,n}, IIDfromp(x). Then E[ p(x)] = p(x; {x i })Prob({x i })d n ({x i }) = n p(x; {x i }) [p(x i )dx i ] i=1 21 Frank Porter, SLUO Lectures on Statistics, August 2006

22 Exercise: Proof of formula for the MSE MSE[ f(x)] = ( f(x) f(x)) 2 ] 2 = [ f(x; n {xi }) f(x) [p(x i )dx i ] = = i=1 ] 2 [ f(x; n {xi }) E( f)+e( f) f(x) i=1 ] 2 { ] 2 [ [ f(x; {xi }) E( f) + E( f) f(x) ][ ]} 2 [ f(x; n {xi }) E( f) E( f) f(x) =Var[ f(x)] + Bias 2 [ f(x)] + 0. i=1 [p(x i )dx i ] [p(x i )dx i ] [In typical treatments of parametric statistics, we assume unbiased estimators, hence the Bias term is zero. That isn t a good assumption here.] 22 Frank Porter, SLUO Lectures on Statistics, August 2006

23 The Problem With Smoothing (I) Thm: [Rosenblatt (1956)] A uniform minimum variance unbiased estimator for p(x) does not exist. Unbiased: E[ p(x)] = p(x). Uniform minimum variance: Var [ p(x) p(x)] Var [ q(x) p(x)], x, for all p(x), where q(x) is any other estimator of p(x). 23 Frank Porter, SLUO Lectures on Statistics, August 2006

24 The Problem With Smoothing (II) For example, suppose we have a kernel estimator: p(x) = 1 n n k(x x i ; w), i=1 Its expectation is: E[ p(x)] = 1 n = n k(x x i ; w)p(x i )dx i i=1 k(x y)p(y)dy. Unless k(x y) =δ(x y), p(x) will be biased for some p(x). But δ(x y) has infinite variance. 24 Frank Porter, SLUO Lectures on Statistics, August 2006

25 The Problem with Smoothing (III) So the nice properties we strive for in parameter estimation (and sometimes achieve) are beyond reach. Intuition: smoothing lowers peaks and fills in valleys. Frequency Red curve: PDF Histogram: Sampling from PDF Black curve: Gaussian kernel estimator for PDF x 25 Frank Porter, SLUO Lectures on Statistics, August 2006

26 Comment on Number of Bins in Histogram Note: Sturges rule, based on optimizing MSE, was used in deciding how many bins, k, to make in the histogram: k =1+log 2 n. The argument behind this rule has been criticized (1995): hyndman/papers/sturges.pdf Indeed we see in our example that we would have by hand selected more bins; our histogram is over-smoothed. There are other rules for optimizing the number of bins. For example, Scott s rule for the bin width is: w =3.5sn 1/3, where s is the sample standard deviation. [More later] 26 Frank Porter, SLUO Lectures on Statistics, August 2006

27 Dependence on Smoothing Parameter Plot showing effect of choice of smoothing parameter : Frequency Red: Sampling PDF Black: Default smoothing (w) Blue: w/2 smoothing Turquoise: w/4 smoothing Green: 2w smoothing x 27 Frank Porter, SLUO Lectures on Statistics, August 2006

28 The Curse of Dimensionality Roger Barlow gave a nice example of the impact of the Curse of Dimensionality in parametric statistics. It is a significant affliction in density estimation as well. Difficult to display and visualize as the number of dimensions increases. All the volume (of a bounded region) goes to the boundary (exponentially!) as the dimensions increases. I.e., data becomes sparse. 1/2, d 1/4 1/8 Tendency for exponentially growing computation requirement with dimensions. Even worse than parametric statistics. 28 Frank Porter, SLUO Lectures on Statistics, August 2006

29 Summary We have introduced: Basic notions in (non-parametric) density estimation Some simple variations on the theme A foundation towards optimization An idea of where and how things will fail Next: Further sophistication on these ideas; and introduction of other variations in approach and application. 29 Frank Porter, SLUO Lectures on Statistics, August 2006

Density Estimation (II)

Density Estimation (II) Yesterday Overview & Issues Histogram Kernel estimators Ideogram Today Further development of optimization Estimating variance and bias Adaptive kernels Multivariate kernel estimation