Nonparametric Density Estimation

Similar documents
Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Nonparametric Methods

Kernel density estimation

Nonparametric Regression

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

41903: Introduction to Nonparametrics

Histogram Härdle, Müller, Sperlich, Werwatz, 1995, Nonparametric and Semiparametric Models, An Introduction

12 - Nonparametric Density Estimation

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Econ 582 Nonparametric Regression

Nonparametric Econometrics

Time Series and Forecasting Lecture 4 NonLinear Time Series

Adaptive Nonparametric Density Estimators

Statistics: Learning models from data

Nonparametric Density Estimation (Multidimension)

More on Estimation. Maximum Likelihood Estimation.

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Chapter 9. Non-Parametric Density Function Estimation

On variable bandwidth kernel density estimation

Nonparametric Density Estimation

Nonparametric Regression. Changliang Zou

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

Regression #3: Properties of OLS Estimator

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

Chapter 9. Non-Parametric Density Function Estimation

Regression #4: Properties of OLS Estimator (Part 2)

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Nonparametric Inference via Bootstrapping the Debiased Estimator

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Density and Distribution Estimation

Nonparametric Function Estimation with Infinite-Order Kernels

Introduction to Regression

Density Estimation (II)

Inference on distributions and quantiles using a finite-sample Dirichlet process

Multiscale Adaptive Inference on Conditional Moment Inequalities

Introduction to Regression

DIFFERENT KINDS OF ESTIMATORS OF THE MEAN DENSITY OF RANDOM CLOSED SETS: THEORETICAL RESULTS AND NUMERICAL EXPERIMENTS.

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE

Analysis methods of heavy-tailed data

Chapter 1. Density Estimation

Simple and Efficient Improvements of Multivariate Local Linear Regression

Introduction to Regression

Lecture Stat Information Criterion

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Confidence intervals for kernel density estimation

3 Nonparametric Density Estimation

Modelling Non-linear and Non-stationary Time Series

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Smooth simultaneous confidence bands for cumulative distribution functions

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

Z-estimators (generalized method of moments)

Kernel density estimation in R

Nonparametric Estimation of Luminosity Functions

Motivational Example

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Supervised Learning: Non-parametric Estimation

Introduction to Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Test for Discontinuities in Nonparametric Regression

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

A tailor made nonparametric density estimate

High-dimensional regression

ON SOME TWO-STEP DENSITY ESTIMATION METHOD

Nonparametric Regression. Badr Missaoui

Bayesian estimation of the discrepancy with misspecified parametric models

Positive data kernel density estimation via the logkde package for R

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

A Brief Analysis of Central Limit Theorem. SIAM Chapter Florida State University

Sliced Inverse Regression

Nonparametric Modal Regression

Nonparametric Statistics

Density estimators for the convolution of discrete and continuous random variables

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Section 7: Local linear regression (loess) and regression discontinuity designs

Kernel density estimation for heavy-tailed distributions...

Local Polynomial Modelling and Its Applications

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

On Model Fitting Procedures for Inhomogeneous Neyman-Scott Processes

Integral approximation by kernel smoothing

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Kernel Density Estimation

Section 9.1. Expected Values of Sums

ON THE CHOICE OF TEST STATISTIC FOR CONDITIONAL MOMENT INEQUALITES. Timothy B. Armstrong. October 2014 Revised July 2017

Gibbs Sampling in Linear Models #1

Improved Fast Gauss Transform. (based on work by Changjiang Yang and Vikas Raykar)

Statistical Data Analysis

Introduction to Curve Estimation

STAT 830 Non-parametric Inference Basics

Nonparametric Density Estimation. October 1, 2018

On robust and efficient estimation of the center of. Symmetry.

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Statistics for Python

Transcription:

Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29

Density Estimation Suppose that you had some data, say on wages, and you wanted to estimate the density of this data. How might you do this? 1 Histogram Approach 2 Assume a class of densities, say normal family, and estimate parameters via MLE. Both of these are reasonable, yet each can be argued to be deficient in some respects. Justin L. Tobias (Purdue) Nonparametric Density Estimation 2 / 29

Density Estimation Let s take a closer look at the histogram method: We first divide the space into say m bins of length 2h. The histogram is then defined as: ˆf (x 0 ) = 1 2nh (Number of x i in same bin as x 0 ). Clearly, this is a proper density estimate since: ˆf (x)dx = 1 2nh [2h(Number of x i in bin 1) + + 2h(Number of x i in bin m)] = 1 2nh 2hn = 1 Justin L. Tobias (Purdue) Nonparametric Density Estimation 3 / 29

Density Estimation Alternatively, we can relax the condition that the m bins form a partition, and instead, add up all of the x i within a given interval of width 2h of x 0 as follows: ˆf (x 0 ) = 1 2nh n i=1 [ I ( )] x i x 0 < 1. h Justin L. Tobias (Purdue) Nonparametric Density Estimation 4 / 29

Density Estimation We might think that the weight function above is not ideal in the following senses: Within the interval, all of the data points should not receive equal weight - those closer to x 0 should get more weight, and conversely for those farther away. In a similar spirit, perhaps all of the data points could potentially get some weight. (This, however, might not be warranted). To this end, we replace the indicator function above with a continuous weighting function (or kernel): ( ) xi x 0 K. h Justin L. Tobias (Purdue) Nonparametric Density Estimation 5 / 29

Density Estimation Though this is not always the case, we often think of K as a mean-zero symmetric density function. Thus, we assign the highest weight to points near zero, or x i values closest to x 0. Points far away receive little or no weight. Symmetry of the kernel implies equal treatment of points the same distance below and above x 0. The above yields the kernel density estimator: ˆf (x 0 ) = 1 nh n n ( ) xi x 0 K. i=1 What if K represents the uniform density on ( 1, 1)? h n Justin L. Tobias (Purdue) Nonparametric Density Estimation 6 / 29

Density Estimation Some points to note: 1 x 0 is a fixed point. Thus, to recover an entire density estimate, we need to choose a grid of points and compute the above for all points in the grid. (3σ rule, perhaps). 2 h n is called a bandwidth or smoothing parameter. If h n is large then K( ) is small and nearly equal for most points resulting in a smoothed estimate. If h n is small, then K( ) assigns nearly zero weight for all points but those very close to x 0 which yields jumpy estimates. Justin L. Tobias (Purdue) Nonparametric Density Estimation 7 / 29

Pointwise Consistency of the Kernel Estimator We will show convergence in mean square, whence consistency. Assume: 1 {x i } is iid 2 K(s)ds = 1 3 sk(s)ds = 0 4 s 3 K(s) ds < 5 s 2 K 2 (s)ds < 6 f (x) is three times differentiable, with bounded third derivative over the support of X. 7 h n 0, nh n With these assumptions, we can show ˆf (x 0 ) p f (x 0 ) for x 0 an interior point in the support of X. Justin L. Tobias (Purdue) Nonparametric Density Estimation 8 / 29

Pointwise Consistency of the Kernel Estimator To establish pointwise consistency, we separately consider the bias and variance of the kernel estimator. As for the bias, note: E[ˆf (x 0 )] = 1 nh n n E i=1 [ K ( )] xi x 0 h n = 1 [ ( )] x x0 E K h n h n = 1 ( x x0 K h n h n ) f (x)dx Justin L. Tobias (Purdue) Nonparametric Density Estimation 9 / 29

Pointwise Consistency of the Kernel Estimator Now let us make a change of variables. Let which implies s = (x x 0 )/h n dx = h n d s. Noting that the limits of integration stay constant, we can write the above as: E[ˆf (x 0 )] = 1 K(s)f (sh n + x 0 )h n ds h n = K(s)f (sh n + x 0 )ds Justin L. Tobias (Purdue) Nonparametric Density Estimation 10 / 29

Pointwise Consistency of the Kernel Estimator Now, let us take a Taylor Series Expansion of f (sh n + x 0 ) around x 0 : f (sh n + x 0 ) = f (x 0 ) + sh n f (x 0 ) + s2 h 2 n 2 f (x 0 ) + h3 ns 3 6 f ( x), for some x in between sh n + x 0 and x 0. So, we can write the expectation of the kernel estimator as: [ = K(s) f (x 0 ) + sh n f (x 0 ) + s2 hn 2 2 f (x 0 ) + h3 ns 3 ] 6 f ( x) ds = f (x 0 ) + h2 n 2 f (x 0 ) s 2 K(s)ds + O(hn) 3 Justin L. Tobias (Purdue) Nonparametric Density Estimation 11 / 29

Pointwise Consistency of the Kernel Estimator To explain the very last term, let a n be a nonstochastic sequence. We write that a n = O(n α ) if a n Cn α for all n sufficiently large. Now, consider the term in the remainder of our expansion: 1 f ( x)s 3 K(s)ds Ch3 n s 3 K(s)ds = O(hn). 3 6 h3 n Justin L. Tobias (Purdue) Nonparametric Density Estimation 12 / 29

Pointwise Consistency of the Kernel Estimator Therefore, the bias of the nonparametric density estimator is given as: hn 2 2 f (x 0 ) s 2 K(s)ds + O(h 3 n) Thus, provided h n 0 and the above regularity conditions, we see that the bias goes to zero, which completes the first part of our proof. Justin L. Tobias (Purdue) Nonparametric Density Estimation 13 / 29

Pointwise Consistency of the Kernel Estimator Thus, it remains to show that the variance of our estimator goes to zero. [ 1 n ( ) ] xi x 0 Var(ˆf (x 0 )) = Var K nh n h n i=1 [ n ( ) ] 1 xi x 0 = n 2 hn 2 Var K h n i=1 [ ( )] 1 x x0 = n 2 hn 2 nvar K h n = 1 [ ( )] ( )] x nhn 2 E K 2 x0 E 2 xi x 0 [K h n h n }{{}}{{} A B Justin L. Tobias (Purdue) Nonparametric Density Estimation 14 / 29

Pointwise Consistency of the Kernel Estimator Using our previous methods for the bias of the kernel estimator, one can show that [nhn] 2 1 B 0. Now, consider A: 1 A = K 2 (s)f (sh nhn 2 n + x 0)h nds [ ] 1 = K 2 (s) f (x 0) + sh nf (x 0) + s2 hn 2 nh n 2 f (x 0) + R ds 1 = f (x 0) K 2 (s)ds + 1 nh n n f (x 0) sk 2 (s)ds + hn 2n f (x 0) K 2 (s)s 2 ds + Justin L. Tobias (Purdue) Nonparametric Density Estimation 15 / 29

Pointwise Consistency of the Kernel Estimator Clearly, each of these terms are converging to zero, provided nh n. Thus, under the stated conditions, the kernel estimator is consistent. Also, ignoring the leading terms, the variance is: Var(ˆf (x 0 )) 1 f (x 0 ) K 2 (s)ds. nh n and the bias is Bias h2 n 2 f (x 0 ) s 2 K(s)ds. Justin L. Tobias (Purdue) Nonparametric Density Estimation 16 / 29

A few observations 1 We see an obvious tradeoff - smaller bandwiths decrease the bias of our estimator, but also increase the variance. Thus, there should be some type of optimal bandwidth selector. 2 The bandwidth choices are like counterfactuals - they describe how the bandwidth would behave if the sample size increased indefinitely. 3 The conditions h n 0 and nh n are offsetting. The intuition is that for larger sample sizes, we look at tighter local neighborhoods, but we do not let the size of this neighborhood shrink too fast! Justin L. Tobias (Purdue) Nonparametric Density Estimation 17 / 29

Optimal Bandwidth Selection Given the bias-variance tradeoff, a natural criterion to use for bandwidth selection is Mean Squared Error: MSE = Bias 2 + Variance [ 1 2 = 2 h2 nf (x) s K(s)ds] 2 + ( 1 f (x) nh n ) K 2 (s)ds. Here, we have ignored the higher-order tems in the bias and variance expression. The above applies to a particular point, x. To get a global criterion, we look at Mean Integrated Squared Error. Justin L. Tobias (Purdue) Nonparametric Density Estimation 18 / 29

Optimal Bandwidth Selection Define K 2 s 2 K(s)ds. Then, MISE = 1 4 h4 nk 2 2 (f (x)) 2 dx + 1 nh n K 2 (s)ds. To find a bandwidth which minimizes this criterion, we take FOC s with respect to h n : (hn) 3 K2 2 (f (x)) 2 dx 1 n(hn) 2 K 2 (s)ds = 0. Justin L. Tobias (Purdue) Nonparametric Density Estimation 19 / 29

Optimal Bandwidth Selection This implies: (h n) 5 = 1 n [ ] [ K 2 (s)ds ] 1 (f (x)) 2 dx K2 2. So, we have the optimal bandwidth: h n = n 1/5 [ ] 1/5 [ K 2 (s)ds ] 1/5 (f (x)) 2 dx K 2/5 2. Justin L. Tobias (Purdue) Nonparametric Density Estimation 20 / 29

Some points which emerge from the above are worth discussing: The optimal bandwidth depends on the unknown function f (x)! This is problematic, since this is the function we are trying to estimate. As n, h n gets small and nh n is also satisfied. If f is large then the density is fluctuating rapidly, and the above tells us that h n will be smaller. This is sensible - when the true density fluctuates rapidly, choose a small bandwidth to capture the local behavior. Justin L. Tobias (Purdue) Nonparametric Density Estimation 21 / 29

Rule-of-thumb Bandwidth Selector A simple rule of thumb is to assume (as in Silverman (1985)) that f and K are normal. This yields the optimal bandwidth: h n = 1.06σn 1/5. A similar rule, argued to be more robust by Silverman, is h n =.9An 1/5, where A = min{std. Dev(x), Interquartile Range(x)/1.34}. Justin L. Tobias (Purdue) Nonparametric Density Estimation 22 / 29

Popular Kernels Name DensityFunction Support ( 3 Epanechnikov 5 1 1 5 x 2) x 5 15 Biweight 16 (1 x 2 ) 2 x 1 Triangular 1 x x 1 Gaussian 1 2π exp ( x 2 2 ) < x < Rectangular 1/2 x 1 In terms of computation, it is clear that Kernels with compact support are most attractive. In this way, we can save time by only determining the weights for those points which the kernel gives non-zero weight. Justin L. Tobias (Purdue) Nonparametric Density Estimation 23 / 29

Multivariate Density Estimation Let x 0 R d. The multivariate kernel density estimator is defined as ˆf (x o ) = 1 n H n K [ H 1 (x i x 0 ) ], i=1 where K : R d R (a d dimensional kernel) and H is a d d bandwidth / smoothing matrix. In practice, the kernel K is often selected as a product kernel: K(u) = d K(u j ). j=1 Justin L. Tobias (Purdue) Nonparametric Density Estimation 24 / 29

Choosing the bandwidth Matrix There are several popular alternatives here, and all of them are rather closely related: 1 H = hi d. This choice is sensible only if all of the individual x variables are on the same scale. Otherwise, you will be choosing a constant degree of smoothing for variables with both high and low variances. 2 H = diag{h 1 h 2 h d }. This is similar to the above. Let s j denote the scaling factor for the j th variable. Then, let H = hdiag{s 1 s 2 s d }. So, this can be regarded as an application of (1) after first transforming all variables to have unit variance. 3 H = hs 1/2, where S is an estimate of the covariance matrix of x, perhaps from the sample: Ŝ = 1/n n i=1 (x i x)(x i x). Justin L. Tobias (Purdue) Nonparametric Density Estimation 25 / 29

Choosing the bandwidth Matrix For simplicity in notation, let S 1/2 = M so that S = MM. (Such a factorization is possible since S is pds). In addition, note that Var(M 1 x) = E ( [M 1 x E(M 1 x)][m 1 x E(M 1 x)] ) = (M) 1 E[(x E(x))(x E(x)) ](M ) 1 = M 1 S(M ) 1 = M 1 MM (M ) 1 = I d Thus the transformed data M 1 x has unit covariance matrix. This process is often called sphering the data. Justin L. Tobias (Purdue) Nonparametric Density Estimation 26 / 29

Choosing the Bandwidth Matrix Now, consider applying our rule above: ˆf (x 0 ) = 1 n hm n K i=1 ( ) 1 h M 1 (x i x 0 ). Let y = M 1 x, x = My. By a change of variables to the above, f (y) = 1 n nh d M M K = 1 nh d n K i=1 i=1 ( 1 h (y i y 0 ) ( ) 1 h (y i y 0 ) ). Justin L. Tobias (Purdue) Nonparametric Density Estimation 27 / 29

Choosing the Bandwidth Matrix The second line is simply a multivariate density estimate for the transformed data y that results from applying H = hi d as in (1). Thus, the approach in (3) is equivalent to one that linearly transforms the data to have unit covariance matrix, applies the bandwidth selection rule in (1), and then re-transforms back to the x scale. There remains the issue of choosing h. One popular rule-of-thumb is to again assume the use of a Gaussian kernel and that the true density is Gaussian, and derive the asymptotic mean integrated squared error. Justin L. Tobias (Purdue) Nonparametric Density Estimation 28 / 29

Choosing the Bandwidth Matrix This yields the rule: H = ( ) 4 1/(d+4) [Σ 1/2 ]n 1/(d+4). d + 2 Note that the form of H for this particular case is exactly in the form of H = h n S 1/2. Justin L. Tobias (Purdue) Nonparametric Density Estimation 29 / 29