Density Estimation (II)

Similar documents
Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Density Estimation (III)

Econ 582 Nonparametric Regression

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation

Nonparametric Methods

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 78 le-tex

Density and Distribution Estimation

Recall the Basics of Hypothesis Testing

Statistical Methods for Particle Physics (I)

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Nonparametric Density Estimation

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Hypothesis testing:power, test statistic CMS:

Defect Detection using Nonparametric Regression

Adaptive Nonparametric Density Estimators

CSCI-6971 Lecture Notes: Monte Carlo integration

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

Chapter 5 continued. Chapter 5 sections

Statistics: Learning models from data

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Monte Carlo Simulations

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

Nonparametric Methods Lecture 5

Kernel density estimation

arxiv: v1 [physics.data-an] 2 Mar 2011

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Time Series and Forecasting Lecture 4 NonLinear Time Series

Statistics for Python

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bandwidth selection for kernel conditional density

Multivariate Distributions

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Lecture Notes on the Gaussian Distribution

Kernel Density Estimation

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

Multivariate statistical methods and data mining in particle physics

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Histogram density estimators based upon a fuzzy partition

STAT 730 Chapter 5: Hypothesis Testing

Unsupervised Kernel Dimension Reduction Supplemental Material

Dalitz Plot Analyses of B D + π π, B + π + π π + and D + s π+ π π + at BABAR

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

A Measure of the Goodness of Fit in Unbinned Likelihood Fits; End of Bayesianism?

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Nonparametric Density Estimation. October 1, 2018

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

A Calculator for Confidence Intervals

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Nonparametric Inference via Bootstrapping the Debiased Estimator

Regression I: Mean Squared Error and Measuring Quality of Fit

Lectures on Statistical Data Analysis

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis

Introduction to Curve Estimation

From Lay, 5.4. If we always treat a matrix as defining a linear transformation, what role does diagonalisation play?

Differential Privacy in an RKHS

On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit

Introduction to Statistical Methods for High Energy Physics

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Density Estimation. Seungjin Choi

41903: Introduction to Nonparametrics

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Statistical Data Analysis 2017/18

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Kullback-Leibler Designs

Negative Association, Ordering and Convergence of Resampling Methods

Inference via Kernel Smoothing of Bootstrap P Values

On a Nonparametric Notion of Residual and its Applications

Nonparametric Estimation of Luminosity Functions

More on Estimation. Maximum Likelihood Estimation.

CSE446: non-parametric methods Spring 2017

Classification: The rest of the story

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Bootstrap, Jackknife and other resampling methods

Estimation of cumulative distribution function with spline functions

Test for Discontinuities in Nonparametric Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Advanced statistical methods for data analysis Lecture 1

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Lecture Notes 1: Vector spaces

A Program for Data Transformations and Kernel Density Estimation

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Goodness-of-Fit Pitfalls and Power Example: Testing Consistency of Two Histograms

Consistent Bivariate Distribution

Web-based Supplementary Material for. Dependence Calibration in Conditional Copulas: A Nonparametric Approach

Modern Methods of Data Analysis - WS 07/08

Goodness-of-Fit Considerations and Comparisons Example: Testing Consistency of Two Histograms

STA 4273H: Statistical Machine Learning

Transcription:

Density Estimation (II) Yesterday Overview & Issues Histogram Kernel estimators Ideogram Today Further development of optimization Estimating variance and bias Adaptive kernels Multivariate kernel estimation 1 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Some References (I) Richard A. Tapia & James R. Thompson, Nonparametric Density Estimation, Johns Hopkins University Press, Baltimore (1978). David W. Scott, Multivariate Density Estimation, John Wiley & Sons, Inc., New York (1992). Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Techniques for Data Analysis, Clarendon Press, Oxford (1997). B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman and Hall (1986); http://nedwww.ipac.caltech.edu/level5/march02/silverman/silver contents.html K. S. Cranmer, Kernel Estimation in High Energy Physics, Comp. Phys. Comm. 136, 198 (2001) [hep-ex/0011057v1]; http://arxiv.org/ps cache/hep-ex/pdf/0011/0011057.pdf 2 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Some References (II) M. Pivk & F. R. Le Diberder, splot: a statistical tool to unfold data distributions, Nucl. Instr. Meth. A 555, 356 (2005). R. Cahn, How splots are Best (2005), http://babar-hn.slac.stanford.edu:5090/hn/aux/auxvol01/rncahn/ rev splots best.pdf BaBar Statistics Working Group, Recommendations for Display of Projections in Multi-Dimensional Analyses, http://www.slac.stanford.edu/bfroot/www/physics/analysis/ Statistics/Documents/MDgraphRec.pdf Additional specific references will noted in the course of the lectures. 3 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Optimization (continued) The MSE (discussed yesterday) for a density is a measure of uncertainty at a point. It is useful to somehow summarize the uncertainty over all points in a single quantity. We wish to establish a notion for the distance from function p(x) to function p(x). Familiar subject for physicists; just dealing with normed vector spaces! Choice of norm is a bit arbitrary; obvious extremes are: p(x) p(x) L sup p(x) p(x) p(x) p(x) L1 p(x) p(x) dx. We ll use (following the crowd; for convenience, not necessarily because it is best ) the L 2 norm, or more precisely, the Integrated Squared Error, ISE: ISE [ p(x) p(x)] 2 dx. x 4 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

MISE In fact, the ISE is still a difficult beast, as it depends on the true density, the estimator, and the sampled data. We may remove this latter dependence by evaluating the Mean Integrated Squared Error (MISE): [ ] MISE E [ISE] = E [ p(x) p(x)] 2 dx = E [ ( p(x) p(x)) 2] dx = MSE [ p(x)] dx IMSE. (Exercise: prove MISE=IMSE). 5 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Consistency A desirable property of an estimator is that the error decreases as the number of samples increases. (Familiar notion from parametric statistics.) Def: A density estimator p(x) is consistent if: as n. MSE [ p(x)] E [ p(x) p(x)] 2 0 6 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Optimal Histograms? (I) Noting that the bin contents of a histogram are binomial-distributed, we could show (exercise!) that, for the histogram density estimator p(x) =h(x)/nw: Var [ p(x)] p(x j ) nw Bias [ p(x)] γ j w, where: x bin j, x j is defined (and exists by mean value theorem) by: p(x) dx = wp(x j ), binj γ j is a positive constant (existing by assumption) such that p(x) p(x j ) <γ j x x j, x bin j, equality is approached as the probability to be in bin j decreases (e.g., by decreasing bin size). 7 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Optimal Histograms? (II) Thus, MSE [ p(x)] = E [ p(x) p(x)] 2 p(x j ) nw + γ2 j w2. Thm: The MSE of the histogram estimator p(x) =h(x)/nw is consistent if the bin width w 0 as n such that nw. Note that the w 0 requirement insures that the bias will approach zero, according to our earlier discussion. Thenw requirement ensures that the variance asymptotically vanishes. Thm: The MSE(x) bound above is minimized when [ ] p(x 1/3 j ). w = w (x) = 2γ 2 j n 8 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Optimal Histograms? (III) Notes: The optimal bin size decreases as 1/n 1/3. The 1/n dependence of the variance is our familiar result for Poisson statistics. The optimal bin size depends on the value of the density in the bin. Suggests adaptive binning...[however, Scott:...in practice there are no reliable algorithms for constructing adaptive histogram meshes. ] Alternatively, the MISE error is minimized (Gaussian kernel, asymptotically, for normally distributed data...)when w = ( ) 4 1/5 σn 1/5. 3 9 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Optimal Histograms? (IV) Density 0.00 0.05 0.10 0.15 0.20 Red: Sampling PDF 2 4 6 8 10 12 14 16 x The standard rules (Sturges, Scott, Freedman-Diaconis) correspond roughly to the coarser binning above. Impression: The standard rules for selecting constant bin widths tend to be lower limits on the optimum. 10 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Choice of Kernel Often, the form of the kernel used doesn t matter very much: Frequency 0 50 100 150 200 Red: Sampling PDF Black: Gaussian kernel Blue: Rectangular kernel Turquoise: Cosine kernel Green: Triangular kernel 2 4 6 8 10 12 14 16 x 11 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Application of the Bootstrap to Density Estimation A means to evaluating how much to trust our density estimate. Bootstrap Algorithm: 1. Form density estimate p from data {x i ; i =1,...,n}. 2. Resample (uniformly) n values from {x i ; i =1,...,n}, with replacement, obtaining {x i ; i =1,...,n} (bootstrap data). 3. Form density estimate p from data {x i ; i =1,...,n}. 4. Repeat steps 2&3 many (N) times to obtain a family of bootstrap density estimates { p i ; i =1,...,N}. 5. The distribution of p i about p mimics the distribution of p about p. 12 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Does the Bootstrap Distribution Really Work? Not quite: Consider, for kernel density estimator (Exercise), E [ p (x)] = E [k(x x i ; w)] = p(x). Thus, the bootstrap distribution about p does not reproduce the bias which may be present in p about p. However, it does properly reproduce the variance of p. Frequency 0 10 0 20 0 30 0 40 0 Red: Sampling PDF Black: Gaussian Kernel Estimator Other colors: Resampled Kernel Estimators 2 4 6 8 10 12 14 16 x 13 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Estimating Bias: The Jackknife (I) The idea: Bias depends on sample size. Assume the bias vanishes asymptotically. We can use the data to estimate the dependence of bias on sample size. Jackknife algorithm: (1) Divide data into k random disjoint subsamples. (2) Evaluate estimator for each subsample. (3) Compare average of estimates on subsamples with estimator based on full dataset. 14 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Estimating Bias: The Jackknife (II) Frequency 0 100 200 300 400 Red: Sampling PDF Histogram: 2000 samples Black: Gaussian kernel estimator Other Colors: Kernel jackknife estimators (each is on 500 samples; to be averaged together for jackknife) 2 4 6 8 10 12 14 16 x Jackknife techniques may also be used to reduce bias (see Scott) 15 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Cross-validation (I) Similar to the jackknife, but different in intent, is the cross-validation procedure (see also Ilya Narsky lectures). In density estimation, cross-validation is used to optimize smoothing bandwidth selection. Improves on theoretical optimizations by making use of the actual sampling distribution, via the available samples. 16 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Cross-validation (II) Basic method ( leave-one-out cross-validation ): Formn subsets of the dataset, each one leaving out a different datum. Use subscript i to denote subset omitting datum x i. For density estimator p(x; w) evaluate the following average over these subsets: 1 n p 2 n i (x) dx 2 n p n i (x i ). i=1 [Exercise: Show that the expectation of this quantity is the MISE of p for n 1 samples, except for a term which is independent of w.] Evaluate the dependence of this quantity on smoothing parameter w, and select the value w for which it is minimized. i=1 17 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Cross-validation (III) Cross Validation MISE 0.125 0.120 0.115 0.110 0.105 0.100 w* 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Choosing smoothing parameter which minimizes estimated MISE. w Probability density 0.00 0.05 0.10 0.15 0.20 0 5 10 15 x Red: Actual PDF Blue: Kernel estimate, using smoothing from cross-validation optimization Dots: Kernel estimate with a default value for smoothing 18 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Adaptive Kernel Estimation (I) We saw in our discussion of histograms that it is probably more optimal to use variable bin widths. This applies to other kernels as well. Indeed, the use of a fixed smoothing parameter, deduced from all of the data introduces a non-local, hence parametric, aspect into the estimation. It is more consistent to look for smoothing which depends on data locally. This is Adaptive kernel estimation. 19 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Adaptive Kernel Estimation (II) We argue that the more data there is in a region, the better that region can be estimated. Thus, in regions of high density, we should use narrower smoothing. In Poisson statistics (e.g., histogram binning), the relative uncertainty scales as N N 1 p(x). Thus, in the region containing x i, the smoothing parameter should be: w(x i )=w / p(x i ). Two issues: (1) What is w? (2) We don t know p(x)! 20 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Adaptive Kernel Estimation (III) For p(x), we may try substituting our fixed kernel estimator, call it p 0 (x). For w, we use dimensional analysis: D [w(x i )] = D[x]; D [p(x)] = D[1/x] D[w ]=D [ x ] = D [ σ ]. Then, e.g., using the MISE-optimized choice earlier, we iterate on our fixed kernel estimator to obtain: where p 1 (x) = 1 n w i = w(x i )= n i=1 1 w i K ( ) x xi, w i ( ) 4 1/5 ρσ 3 p 0 (x i ) n 1/5. ρ is a factor which may be further optimized, or typically set to one. The iteration on the fixed-kernel estimator nearly removes the dependence on our intitial choice of w. [See refs for: Discussion of boundaries.] 21 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

KEYS Adaptive Kernel Estimation The KEYS ( Kernel Estimating Your Shapes ) PDF is a package for adaptive kernel estimation, see KS Cranmer reference. http://java.freehep.org/jcvslet/jcvslet/diff/freehep/freehep/hep/aida/ref/ pdf/nonparametricpdf.java/1.1/1.2 Implementation in RooFit: Figure 9 Non-parametric p.d.f.s: Left: histogram of unbinned input data, Middle: Histogrambased p.d.f (2 nd order interpolation), Right: KEYS p.d.f from original unbinned input data. from: W. Verkerke and D. Kirkby, RooFit Users Manual V2.07: http://roofit.sourceforge.net/docs/roofit Users Manual 2.07-29.pdf 22 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

KEYS (II): Example Figure showing comparison of KEYS adaptive kernel estimator (dashed curve) withthetrue PDF (solid curve): Events / 0.02 GeV 70 60 50 40 30 20 10 0 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Reconstructed Candidate Mass (GeV) Figure 12: A comparison of the histogram of 1000 events (data points) with the underlying PDF used to generate these events (solid curve) and a non-parameteric KEYS model of these events (dashed curve). [From BaBar BAD 18v14 (internal document, with permission...)] 23 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Multivariate Kernel Estimation (I) Besides the curse of dimensionality, the multi-dimensional case introduces the complication of covariance. When using a product kernel, the local estimator has diagonal covariance matrix. In principle, we could apply a local linear transformation of the data to a coordinate system with diagonal covariance matrices. This amounts to a non-linear transfomation of the data in a global sense, and may not be straightforward. We can at least work in the system for which the overall covariance matrix of the data is diagonal. 24 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Multivariate Kernel Estimation (II) If {y i } is the suitably diagonalized data, the product fixed kernel estimator in d dimensions is p 0 (y) = 1 n d 1 K y(j) y (j) i, n w j w j i=1 j=1 where y (j) denotes the j-th component of the vector y. The asymptotic, normal MISE-optimized smoothing parameters are now: ( ) 4 1/(d+4) w j = σ d +2 j n 1/(d+4). The corresponding adaptive kernel estimator follows the discussion as for the univariate case. An issue in the scaling for the adaptive bandwidth arises when the multivariate data is approximately sampled from a lower dimensionality than the dimension d. See references. 25 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Multivariate Kernel Estimation: Example (I) Example in which the sampling distribution has diagonal covariance matrix (locally and globally). xy[,2] 2 4 6 8 10 12 14 4 6 8 10 12 14 16 xy[,1] 0.020 0.03 Density function 0.015 0.010 0.005 Density function 0.02 0.01 0.000 0.00 10 xy[2] 15 10 15 5 5 10 xy[1] Default smoothing (w) xy[2] 5 5 10 xy[1] w/2 smoothing 26 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Multivariate Kernel Estimation: Example (II) Example in which the sampling distribution has non-diagonal covariance matrix. xy45[,2] 10 15 20 25 30 35 40 10 5 0 5 10 15 xy45[,1] 0.005 0.010 0.008 0.004 Density function 0.003 0.002 0.001 0.000 0.008 Density function 0.006 0.004 0.002 0.000 Density function 0.006 0.004 0.002 0.000 35 35 35 30 30 30 25 xy45[2] 20 15 10 10 5 0 xy45[1] 10 5 25 xy45[2] 20 15 10 10 Default smoothing (w) w/2 smoothing Intermediate smoothing 5 0 xy45[1] 10 5 25 xy45[2] 20 15 10 10 5 0 xy45[1] 10 5 27 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006

Summary We have looked at: Optimization Cross-validation Adaptive kernels Error estimation Variance(bootstrap) Bias(jackknife) A complete analysis puts these ingredients together. Multivariate kernel estimation Next: Series estimation; Monte Carlo weighting; unfolding; non-parametric regression; splots. 28 Frank Porter, SLUO Lectures on Statistics, 15 17 August 2006