Density Estimation (III)

Size: px

Start display at page:

Download "Density Estimation (III)"

Geraldine Robinson
5 years ago
Views:

1 Density Estimation (III) Yesterday Cross-validation Adaptive kernels Variance(bootstrap) Bias(jackknife) Multivariate kernel estimation Today Series estimation Monte Carlo weighting Unfolding Non-parametric regression splots 1 Frank Porter, SLUO Lectures on Statistics, August 2006

2 Some References (I) Richard A. Tapia & James R. Thompson, Nonparametric Density Estimation, Johns Hopkins University Press, Baltimore (1978). David W. Scott, Multivariate Density Estimation, John Wiley & Sons, Inc., New York (1992). Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Techniques for Data Analysis, Clarendon Press, Oxford (1997). B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman and Hall (1986); contents.html K. S. Cranmer, Kernel Estimation in High Energy Physics, Comp. Phys. Comm. 136, 198 (2001) [hep-ex/ v1]; cache/hep-ex/pdf/0011/ pdf 2 Frank Porter, SLUO Lectures on Statistics, August 2006

3 Some References (II) M. Pivk & F. R. Le Diberder, splot: a statistical tool to unfold data distributions, Nucl. Instr. Meth. A 555, 356 (2005). R. Cahn, How splots are Best (2005), rev splots best.pdf BaBar Statistics Working Group, Recommendations for Display of Projections in Multi-Dimensional Analyses, Statistics/Documents/MDgraphRec.pdf Additional specific references will noted in the course of the lectures. 3 Frank Porter, SLUO Lectures on Statistics, August 2006

4 Estimation Using Orthogonal Series (I) We may take an alternative approach, and imagine expanding the PDF in a series of orthogonal functions: p(x) = a k ψ k (x), k=0 where a k = ψ k (x)p(x)ρ(x) dx = E [ψ k (x)ρ(x)], and ψ k (x)ψ l (x)ρ(x) dx = δ kl. [ρ(x) is a weight function ] 4 Frank Porter, SLUO Lectures on Statistics, August 2006

5 Estimation Using Orthogonal Series (II) Since the expansion coefficients are expectation values of functions, it is natural to substitute sample averages as estimators for them: â k = 1 n n ψ k (x i )ρ(x i ), i=1 and thus: p(x) = m â k ψ k (x), j=1 where the number of terms m is chosen by some optimization criterion. Note the analogy between choosing m and choosing smoothing parameter w in kernel estimators; and between choosing K and choosing {ψ k }. 5 Frank Porter, SLUO Lectures on Statistics, August 2006

6 Estimation Using Orthogonal Series (III) We are actually rather familiar with estimation using orthogonal series: It is the method of moments! From G. Brandenburg et al., Determination of the K (1800) Spin Parity, SLAC-PUB-1670 (1975). 6 Frank Porter, SLUO Lectures on Statistics, August 2006

7 UsingMonteCarloModels(I) Often build up a data model using Monte Carlo computations of different processes, which are added together to get the complete model. May involve weighting of events, if more integrated luminosity is simulated for some processes than for others. The overall simulated empirical density distribution is then: p(x) = n ρ i δ(x x i ), i=1 where ρ i =1(or n to correspond with an event sample of some integrated luminosity) 7 Frank Porter, SLUO Lectures on Statistics, August 2006

8 Using Monte Carlo Models (II) The weights must be included in computing the sample covariance matrix (x i has components x (k) i,k =1...d): V kl = n i=1 (x (k) ρ i i μ k )(x (l) i μ l ) j ρ, j where μ k = i ρ ix (k) i / j ρ j isthesamplemeanindimensionk. Assuming we have transformed to a diagonal system using this covariance matrix, our product kernel density based on this simluation is then: p 0 (x) = 1 n d 1 j ρ ρ i K x(k) x (k) i. j w k w k i=1 k=1 This may be iterated to obtain an adaptive kernel estimator as discussed earlier. 8 Frank Porter, SLUO Lectures on Statistics, August 2006

9 Unfolding [Another big subject; our treatment will be cursory. Glen Cowan, Statistical Data Analysis, Oxford University Press (1998) devotes a chapter to unfolding.] We may not be satisfied with merely estimating the density from which our sample {x i } was drawn. The interesting physics may be obscured by convolution with uninteresting functions, for example efficiency dependencies or radiative corrections. We assume the convolution function is known; often it is also estimated via auxillary measurements. Because data fluctuates, unfolding usually also necessitates smoothing to control fluctuations. Referred to as Regularization inthis context. 9 Frank Porter, SLUO Lectures on Statistics, August 2006

10 Unfolding: Measurement of R R σ(e + e hadrons) σ (e + e μ + μ [lowest order QED]) R J/ψ ψ(2s) Υ Z 10 2 φ 10 ω ρ ρ s [GeV] R, corrected for initial state radiation (from RPP 2006) 10 Frank Porter, SLUO Lectures on Statistics, August 2006

11 Unfolding Formalism Typical problem: Sample from a distribution with some kernel function R(x, y): o(x) = R(x, y)p(y)dy. We are given a sampling ô, and wish to estimate p. In principle, the solution is easy: p(y) = R 1 (y, x)ô(x)dx, where R 1 (x, y)r(y, x ) dy = δ(x x ). In practice, our observations are discrete, and we need to interpolate/smooth. 11 Frank Porter, SLUO Lectures on Statistics, August 2006

12 Unfolding: Iterative Approach (I) If we don t know how (or are too lazy) to invert R, wemaytryan iterative solution. For example, consider the problem of unfolding radiative corrections. The observed cross section, σ E (s) is related to the interesting cross section σ according to: σ E (s) =σ(s)+δσ(s), where δσ(s) = R(s, s )σ(s ) ds. We form an iterative estimate for σ(s) according to: σ 0 (s) =σ E (s) σ i (s) =σ E (s) R(s, s ) σ i 1 (s ) ds, i =1, 2,... This is just the Neumann series solution to an integral equation! 12 Frank Porter, SLUO Lectures on Statistics, August 2006

13 Unfolding: Iterative Approach (II) Since σ E (s) is measured at discrete s values and with some statistical precision, some smoothing/interpolation is still required. R in region around charm threshold (from SLAC-PUB-4160) 13 Frank Porter, SLUO Lectures on Statistics, August 2006

14 Unfolding: Regularization(I) If we know R 1 we can incorporate the smoothing/interpolation more directly. We could use the techniques already described to form a smoothed estimate ô, and then use the transformation R 1 to obtain the estimator p. For simplicity, consider here the problem of unfolding a histogram. Then we restate the earlier integral formula as: o i = k R ij p j, j=1 where R is a square matrix, assumed invertible. 14 Frank Porter, SLUO Lectures on Statistics, August 2006

15 Unfolding: Regularization(II) A popular procedure is to form a likelihood (or χ 2 ),butaddanextra term, a regulator, to impose smoothing. The modified likelihood is maximized to obtain the estimate for {p j }. ln L ln L =lnl + ws(ô i ). The regulator function S(ô i ) as usual gets its smoothing effect by being somewhat non-local. A popular choice is to add a curvature term to be minimized (hence smoothed): S(ô i )= k 1 j=2 [(ô i+1 ô i ) (ô i ô i 1 )] Frank Porter, SLUO Lectures on Statistics, August 2006

16 Unfolding: Regularization(III) This is implemented, for example, in the RUN package, see V. Blobel: blobel/wwwrunf.html Also, GURU, A. Höcker & V. Kartvelishvili, NIM A 372, 469 (1996). For more, see Glen Cowan s Durham paper at: This paper has a nice demonstration of the importance of smoothing: Note that the transformation R itself corresponds to a sort of smoother, as it acts non-locally. The act of unfolding a smoother can produce large variances. 16 Frank Porter, SLUO Lectures on Statistics, August 2006

17 Non-parametric Regression(I) Regression is the problem of estimating the dependence of some response variableona predictor variable. Given a dataset of predictorresponse pairs {(x i,y i ),i=1,...,n}, we write the relationship as: y i = r(x i )+ɛ i, where the error ɛ i might also depend on x through the parameters of the sampling distribution it represents. We are used to solving this problem with parametric statistics, for example, the dependence of accelerator background on beam current, where we might try a power-law form. We may also bring our non-parametric methods to bear on this problem. 17 Frank Porter, SLUO Lectures on Statistics, August 2006

18 Non-parametric Regression(II) The sampling of the response-predictor pairs may be a fixed design in which the x i values are deliberately selected, or a random design, in which (x i,y i ) is drawn from some joint PDF. We ll work in the context of the random design here, and also will work in two dimensions. The regression function r may be expressed as: yp(x, y) dy r(x) =E [y x] = yp(y x) dy =. p(x, y)dy 18 Frank Porter, SLUO Lectures on Statistics, August 2006

19 Non-parametric Regression: Local Mean Estimator Let us construct an estimator for r by substituting a bivariate product kernel estimator for the unknown PDF p(x, y): 1 n ( ) ( ) x xi y yi p(x, y) = K K. nw x w y w x w y i=1 Assuming a symmetric kernel, after a little algebra we find: n ( ) x xi n ( ) x xi r(x) = y i K / K. i=1 w x This is known as the local mean estimator. Note the absence of dependence on w y, and the linearity in the y i. i=1 w x 19 Frank Porter, SLUO Lectures on Statistics, August 2006

20 Non-parametric Regression: Local Linear Estimator We may achieve better properties by considering local polynomial estimators, corresponding to local polynomial fits to the data. This may be achieved with a least-squares minimization (the local mean is the result for a fit to a zero-order polynomial). Thus, the local linear regression estimate is given by: where r(x) = n i=1 [S 2 (x) S 1 (x)(x i x)] K ((x i x)/w x ) y i S 2 (x)s 0 (x) S 1 (x) 2, S l (x) n ( ) (x i x) l x xi K. i=1 w x 20 Frank Porter, SLUO Lectures on Statistics, August 2006

21 Examples: Local Linear Regression yy xx yr xr45 lowess package ndoc BaBar docs ht Trees year diam 21 Frank Porter, SLUO Lectures on Statistics, August 2006

22 splots The use of the density estimation technique known as splots has gained popularity in BaBar (and perhaps elsewhere?). Multi-variate: uses distribution on a subset of variables to predict distribution in another subset. Based on a (parametric) model in the predictor variables, with different categories (e.g., signal and background ). Provides a means to visualize agreement with model for each category. Provides an easy way to do background subtraction. 22 Frank Porter, SLUO Lectures on Statistics, August 2006

23 splots formalism (I) Assume a total of r + p parameters in the overall fit to the data: The expected number of events, N j,j =1,...,r in each category, Distribution parameters, {θ i,i=1,...,p}. We use a total of N events to estimate these parameters via a maximum likelihood fit to the sample {x}. Wish to find weights w j (x i ), depending only on {x } {x} (and implicitly on the unknown parameters), such that the asymptotic distribution in y/ {x } of the weighted events is the sampling distribution in y, for any chosen category j. Assume that y and {x } are statistically independent within each category. 23 Frank Porter, SLUO Lectures on Statistics, August 2006

24 splots formalism (II) The weights which satisfy our criterion and produce minimum variance summed over the histogram are given by: where w j (e) = rk=1 V jk f k (x e) rk=1 N k f k (x e), w j (e) is the weight for event e in category j V is the covariance matrix from a reduced fit (i.e., excluding y): (V 1) jk N e=1 f j (x e)f k (x e) [ ri=1 Ni f i (x e)] 2, N k is the estimate of the number of events in category k, according to the reduced fit. f j (x e) is the PDF for category j evaluated at x e. 24 Frank Porter, SLUO Lectures on Statistics, August 2006

25 splots formalism (III) Finally, the splot is constructed by adding each event e with y = y e to the y-histogram (or scatter plot, etc, if y is multivariate), with weight w j (e). The resulting histogram is then an estimator for the true distribution in y for category j. 25 Frank Porter, SLUO Lectures on Statistics, August 2006

26 splot Example (I) Example from BaBar B 0 π + π,k + π analysis, showing comparison of splot method (right figure) with a plot enhanced in signal fraction with a cut on likelihood (left figure). Events / 10 MeV GeV ΔE ΔE (GeV) Figure 1. Signal distribution of the ΔE variable. The left figure is obtained applying a cut on the Likelihood ratio to enrich the data sample in signal events (about 60% of signal is kept). The right figure shows the s Plot forsignal(allevents are kept). From: M. Pivk, splot: A Quick Introduction, arxiv:physics/ Frank Porter, SLUO Lectures on Statistics, August 2006

27 splot Example (II) Events / GeV/c K + π π + γ K + π π 0 γ BF 10 6 / 0.02 GeV/c 2 B + K + π π + γ B 0 K + π π 0 γ K S π π + γ B 0 K 0 π π + γ 0 40 K S π π 0 γ B + K 0 π + π 0 γ m ES (GeV/c 2 ) m Kππ (GeV/c 2 ) (From hep-ex/ ) 27 Frank Porter, SLUO Lectures on Statistics, August 2006

28 splot Errors Typically the splot error in a bin is estimated simply according to the sum of the squares of the weights. This sometimes leads to visually misleading impressions, due to fluctuations on small statistics. If the plot is being made for a distribution for which there is a prediction, then that distribution can be used to estimate the expected uncertainties, and these can be plotted. If the plot is being made for a distribution for which there is no prediction, it is more difficult, but a (smoothed) estimate from the empirical distribution may be used to estimate the expected errors. 28 Frank Porter, SLUO Lectures on Statistics, August 2006

29 Summary(I) We have looked at several topics concerning density estimation: EPDF, Histograms, Ideograms Kernel and Series estimators Optimization considerations Error analysis Multivariate issues Unfolding Non-parametric regression splots 29 Frank Porter, SLUO Lectures on Statistics, August 2006

30 Summary(II) We also left many things out, for a few examples: Time series analyses Boundary complications Dimension reduction Next: See Ilya Narsky lectures for Machine Learning andthe Classification/Discrimination problems. Monday, August 28 Friday, September 1; Panofsky Auditorium, 10:00 30 Frank Porter, SLUO Lectures on Statistics, August 2006

Density Estimation (II)

Density Estimation (II) Yesterday Overview & Issues Histogram Kernel estimators Ideogram Today Further development of optimization Estimating variance and bias Adaptive kernels Multivariate kernel estimation