O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

Similar documents
A Novel Nonparametric Density Estimator

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE

1 Introduction A central problem in kernel density estimation is the data-driven selection of smoothing parameters. During the recent years, many dier

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

BOUNDARY KERNELS FOR DISTRIBUTION FUNCTION ESTIMATION

arxiv: v1 [stat.me] 25 Mar 2019

ESTIMATORS IN THE CONTEXT OF ACTUARIAL LOSS MODEL A COMPARISON OF TWO NONPARAMETRIC DENSITY MENGJUE TANG A THESIS MATHEMATICS AND STATISTICS

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Data-Based Choice of Histogram Bin Width. M. P. Wand. Australian Graduate School of Management. University of New South Wales.

Kernel Density Estimation

Nonparametric Statistics

Integral approximation by kernel smoothing

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis

Bickel Rosenblatt test

Adaptive Nonparametric Density Estimators

Confidence intervals for kernel density estimation

Nonparametric Methods

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Chapter 1. Density Estimation

ON SOME TWO-STEP DENSITY ESTIMATION METHOD

Log-transform kernel density estimation of income distribution

Nonparametric Estimation of Luminosity Functions

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Nonparametric Methods Lecture 5

Applications of nonparametric methods in economic and political science

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Kernel density estimation with doubly truncated data

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Histogram Härdle, Müller, Sperlich, Werwatz, 1995, Nonparametric and Semiparametric Models, An Introduction

Estimation of cumulative distribution function with spline functions

Density and Distribution Estimation

Updating on the Kernel Density Estimation for Compositional Data

A tailor made nonparametric density estimate

arxiv: v2 [stat.me] 12 Sep 2017

Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Log-Transform Kernel Density Estimation of Income Distribution

Local Polynomial Regression

Nonparametric confidence intervals. for receiver operating characteristic curves

arxiv: v2 [math.st] 16 Oct 2013

Statistics: Learning models from data

Local Polynomial Modelling and Its Applications

Modelling Non-linear and Non-stationary Time Series

Bandwidth Selection in Nonparametric Kernel Estimation

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

SUPPLEMENTARY MATERIAL FOR PUBLICATION ONLINE 1

Smooth functions and local extreme values

4 Nonparametric Regression

Adaptive Kernel Estimation of The Hazard Rate Function

Nonparametric Density Estimation. October 1, 2018

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

3 Nonparametric Density Estimation

Relative error prediction via kernel regression smoothers

Log-Density Estimation with Application to Approximate Likelihood Inference

Smooth simultaneous confidence bands for cumulative distribution functions

Kernel Density Estimation

Kernel density estimation of reliability with applications to extreme value distribution

A Note on Data-Adaptive Bandwidth Selection for Sequential Kernel Smoothers

Kernel density estimation

Acceleration of some empirical means. Application to semiparametric regression

Efficient Density Estimation using Fejér-Type Kernel Functions

Local linear hazard rate estimation and bandwidth selection

Bandwith selection based on a special choice of the kernel

Nonparametric Regression. Badr Missaoui

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation

Bandwidth selection for kernel conditional density

Bias Correction and Higher Order Kernel Functions

arxiv: v1 [stat.me] 17 Jan 2008

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Introduction to Curve Estimation

Density Estimation (II)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Teruko Takada Department of Economics, University of Illinois. Abstract

Akaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Nonparametric estimation and symmetry tests for

BAYESIAN DECISION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

arxiv: v2 [stat.me] 13 Sep 2007

Introduction to Regression

Chapter 9. Non-Parametric Density Function Estimation

Introduction to Regression

12 - Nonparametric Density Estimation

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Analysis methods of heavy-tailed data

Nonparametric Density Estimation

A New Method for Varying Adaptive Bandwidth Selection

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Positive data kernel density estimation via the logkde package for R

The Nonparametric Bootstrap

Econ 582 Nonparametric Regression

Nonparametric Estimation of Smooth Conditional Distributions

Chapter 9. Non-Parametric Density Function Estimation

Time Series and Forecasting Lecture 4 NonLinear Time Series

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

From CDF to PDF A Density Estimation Method for High Dimensional Data

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Modelling a Particular Philatelic Mixture using the Mixture Method of Clustering

On robust and efficient estimation of the center of. Symmetry.

Maximum likelihood kernel density estimation: on the potential of convolution sieves

Transcription:

O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1

Overview The nonparametric problem The Parzen-Rosenblatt kernel density Cross-validation and plug-in methods for selection Combining cross-validation and plug-in methods (based on a recent joint work with J.E. Chacón, Universidad de Extremadura, Spain) This is in order to obtain a data-based selector that presents an overall good performance for a large set of underlying densities. PhD Program UC UP February 18, 2011 2

We have observations X 1, X 2,...,X n independent and identically distributed real-valued random variables with unknown probability density function f: P(a < X < b) = for all < a < b < +. b a f(x)dx We want to estimate f based on the previous observations. The goal in nonparametric is to estimate f making only minimal assumptions about f. PhD Program UC UP February 18, 2011 3

Exploring data is one of the goals of nonparametric density estimation. Hidalgo Stamp Data: thickness of 485 postage stamps that were printed over a long time in Mexico during the 19th century. frequency 0 20 40 60 80 bin width = 0.003 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 thickness (in mm.) The idea is to gain insights into the number of different types of papers that were used to print the postage stamps. PhD Program UC UP February 18, 2011 4

estimation In this talk, we will restrict our attention to another well known density introduced by Rosenblatt (1956) and Parzen (1962): the kernel density. The motivation given by Rosenblatt (1956) for this density is based on the fact f(x) = F (x) where the cumulative distribution function F can be estimated by the empirical distribution function given by F n (x) = 1 n I n ],x] (X i ), with I A (y) = i=1 { 1 if y A 0 if y / A. PhD Program UC UP February 18, 2011 5

estimation If h is a small positive number we could expect that F(x + h) F(x h) f(x) 2h F n(x + h) F n (x h) 2h = 1 n 1 n 2h I ]x h,x+h](x i ) i=1 = 1 n ( ) x Xi K 0, nh h i=1 where K 0 ( ) = 1 2 I ] 1,1]( ). PhD Program UC UP February 18, 2011 6

estimation The Parzen-Rosenblatt kernel is obtained by replacing K 0 by a general symmetric density function K: where: f h (x) = 1 nh n ( ) x Xi K = 1 h n i=1 n K h (x X i ) h = h n, the or smoothing parameter, is a sequence of strictly positive real numbers converging to zero as n tends to infinity; K, the kernel, is a bounded and symmetric density function. i=1 Contrary to the histogram, the Parzen-Rosenblatt gives regular estimates for f if we take for K a regular density function. PhD Program UC UP February 18, 2011 7

estimation The choice of the is a crucial issue for kernel density estimation. estimates for the Hidalgo Stamp Data (n=485): 0 10 20 30 h = 0.005 0 10 20 30 40 50 60 h = 0.0005 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) K(x) = (2π) 1/2 e x2 /2 PhD Program UC UP February 18, 2011 8

The role played by h For the mean integrated square error MISE(f; n, h) = E {f h (x) f(x)} 2 dx, we have MISE(f; n, h) = Varf h (x)dx + 1 nh K 2 (u)du + h4 4 {Ef h (x) f(x)} 2 dx u 2 K(u)du f (x) 2 dx. If h is too small we obtain an with a small bias but with a large variability. If h is too large we obtain an with a large bias but with a small variability. PhD Program UC UP February 18, 2011 9

The role played by h Choosing the corresponds to balancing bias and variance. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.1 3 2 1 0 1 2 3 undersmoothing small h small bias but large variability PhD Program UC UP February 18, 2011 10

The role played by h Choosing the corresponds to balancing bias and variance. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.5 3 2 1 0 1 2 3 oversmoothing large h small variability but large bias PhD Program UC UP February 18, 2011 11

The role played by h The main challenge in smoothing is to determine how much smoothing to do. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.36 3 2 1 0 1 2 3 almost right The choice of the kernel is not so relevant to the behaviour. PhD Program UC UP February 18, 2011 12

selectors Under general conditions on the kernel and on the underlying density function, for each n N there exists an optimal h MISE = h MISE (n; f) in the sense that MISE(f; n, h MISE ) MISE(f; n, h), for all h > 0. But h MISE depends on the unknown density f... We are interested in methods for choosing h that are based on the observations X 1,...,X n, h = h(x 1,...,X n ), and satisfy h(x 1,...,X n ) h MISE, for a large classe of densities. PhD Program UC UP February 18, 2011 13

Cross-validation selection For each h > 0, we start by considering an unbiased of MISE(f; n, h) R(f), given by CV(h) = R(K) nh + 1 n(n 1) i j ( n 1 n K h K h 2K h )(X i X j ), and we take for h the value ĥcv the minimises CV(h). (Rudemo, 1982; Bowman, 1984) Under some regularity conditions on f and K we have ĥ ( CV 1 = O p n 1/10). h MISE (Hall, 1983; Hall & Marron, 1987) PhD Program UC UP February 18, 2011 14

Plug-in selection The plug-in method is based on a simple idea that goes back to Woodroofe (1970). We start with an asymptotic approximation h 0 for the optimal h MISE : where and h 0 = c K ψ 1/5 4 n 1/5 c K = R(K) 1/5 ( u 2 K(u)du ) 2/5 ψ r = f (r) (x)f(x)dx, r = 0, 2, 4,... The plug-in selector is obtained by replacing the unknown quantities in h 0 by consistent s: ĥ PI = c ˆψ 1/5 K 4 n 1/5 PhD Program UC UP February 18, 2011 15

Estimating the functional ψ r A class of kernel s of ψ r was introduced by Hall and Marron (1987a, 1991) and Jones and Sheather (1991): ˆψ r (g) = 1 n 2 n i,j=1 U (r) g (X i X j ), where g is a new and U is a bounded, symmetric and r-times differentiable kernel. For U = φ, the that minimises the asymptotic mean square error of ˆψ r (g) is given by g 0,r = ( 2 φ (r) (0) ψ r+2 1) 1/(r+3) n 1/(r+3). In the case of the estimation of ψ 4, this depends (again) on the unknown quantity ψ 6! PhD Program UC UP February 18, 2011 16

Multistage plug-in estimation of ψ r In order to estimate ψ 4 we have then the following schema: to estimate we consider we need ψ 4 ˆψ4 (ĝ 0,4 ) ψ 6 ψ 6 ˆψ6 (ĝ 0,6 ) ψ 8 ψ 8 ˆψ8 (ĝ 0,8 ). ψ 10 ψ 4+2(l 1) ˆψ 4+2(l 1) (ĝ 0,4+2(l 1) ) ψ 4+2l where ĝ 0,r = ( 2 φ (r) (0) ˆψ r+2 1) 1/(r+3) n 1/(r+3). PhD Program UC UP February 18, 2011 17

Multistage plug-in estimation of ψ 4 The usual strategy to stop this cyclic process, is to use a parametric of ψ 4+2l based on some parametric reference distribution family. The standard choice for the reference distribution family is the normal or Gaussian family: f(x) = (2πσ) 1/2 e x2 /(2σ 2). In this case, ψ 4+2l is estimated by ˆψ NR 4+2l = φ(4+2l) (0)(2ˆσ 2 ) (5+2l)/2, where ˆσ denotes any scale estimate. PhD Program UC UP February 18, 2011 18

Multistage plug-in estimation of ψ 4 For a fixed l {1, 2,...} the l-stage of ψ 4 is: ˆψ 4+2l NR ĝ 0,4+2(l 1) ւ ˆψ 4+2(l 1) ĝ 0,4+2(l 2) ˆψ 4+2(l 2) ւ ĝ 0,4+2(l 3) ˆψ 4+2(l 3) ւ. ւ ĝ 0,4 ˆψ4 ˆψ 4,l PhD Program UC UP February 18, 2011 19

Multistage plug-in selector Depending on the number l {1, 2,...} of considered pilot stages of estimation we get different s ˆψ 4,l of ψ 4. The associated l-stage plug-in selector for the kernel density is given by ĥ PI,l = c ˆψ 1/5 K 4,l n 1/5 If f has bounded derivatives up to order 4 + 2l then ĥ PI,l h MISE 1 = O p ( n α ), with α = 2/7 for l = 1 and α = 5/14 for all l 2. (CT, 2003) PhD Program UC UP February 18, 2011 20

Two-stage plug-in selector For the standard choice l = 2 we have: ˆψ NR 8 ĝ 0,6 ˆψ 6 ւ ĝ 0,4 ˆψ 4 ˆψ 4,2 The associated two-stage plug-in selector is given by ĥ PI,2 = c ˆψ 1/5 K 4,2 n 1/5 PhD Program UC UP February 18, 2011 21

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 CV ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 PI.2 CV PhD Program UC UP February 18, 2011 22

Finite sample behaviour ISE 0.02 0.04 0.06 0.08 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 PI.2 CV ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 CV PhD Program UC UP February 18, 2011 23

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 0.020 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages PhD Program UC UP February 18, 2011 24

Finite sample behaviour ISE 0.02 0.06 0.10 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages PhD Program UC UP February 18, 2011 25

Multistage plug-in selector From a finite-sample point of view the performance of ĥpi,l strongly depends on the considered number of stages. For strongly skewed or asymmetric multimodal densities the standard choice l = 2 gives poor results. The natural question that arises from the previous considerations is: How can we choose the number of pilot stages l? This is an old question posed by Park and Marron (1992). In order to answer this question, the idea developed by Chacón and CT (2008) was to combine plug-in and cross-validation methods. PhD Program UC UP February 18, 2011 26

Combining plug-in and cross-validation procedures We started by fixing minimum and a maximum number of pilot stages L and L and by choosing a stage l among the set of possible pilot stages L = {L, L + 1,...,L} This is equivalent to select one of the s ĥ PI,l = c K ˆψ 1/5 4,l n 1/5, l L. Recall that each one of these s has good asymptotic properties. PhD Program UC UP February 18, 2011 27

Combining plug-in and cross-validation procedures In order to select one of the previous multistage plug-in s we consider a weighted version of the cross-validation criterion function given by CV γ (h) = R(K) nh + γ n(n 1) i j ( n 1 n K h K h 2K h )(X i X j ), for some 0 < γ 1 that needs to be fixed by the user. Finally, we take the ĥ PI,ˆl = c ˆψ 1/5 K n 1/5 4,ˆl where ˆl = argmin l L CV γ (ĥpi,l). PhD Program UC UP February 18, 2011 28

Asymptotic behaviour If f has bounded derivatives up to order 4 + 2L and ψ 4+2l ψ NR 4+2l (σ f), for all l = 1, 2,...,L, (1) then ĥ PI,ˆl h MISE 1 = O p ( n α ) with α = 2/7 for L = 1 and α = 5/14 for L 2. (Chacón & CT, 2008) Condition (1) is not very restrictive due to the smoothness of the normal distribution. This result justifies the recommendation of using L = 2. PhD Program UC UP February 18, 2011 29

Asymptotic behaviour Proof: From the definite-positivity property of the class of gaussian based kernels used in the multistage estimation process one can prove that P(Ω L,L ) 1 where } Ω L,L = {ĥpi,l ĥpi,l 1... ĥpi,l+1 ĥpi,l. The conclusion follows easily from the asymptotic behaviour of ĥ PI,L and ĥpi,l, since for a sample in Ω L,L we have ĥ PI,L h MISE 1 ĥ PI,ˆl(L) h MISE 1 ĥpi,l h MISE 1. PhD Program UC UP February 18, 2011 30

On the role played by L and γ ISE Standard normal density: 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 γ = 0.3 γ = 0.5 γ = 0.7 ISE 0.01 0.02 0.03 0.04 0.05 0.06 Claw density: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 3 2 1 0 1 2 3 γ = 0.3 γ = 0.5 γ = 0.7 PI.2 10 20 30 10 20 30 10 20 30 CV PI.2 10 20 30 10 20 30 10 20 30 CV L L Distribution of ISE(ĥPI,ˆl ) as a function of L and γ (n = 200) PhD Program UC UP February 18, 2011 31

Choosing L and γ in practice The boxplots show that a larger value for L is recommended especially for hard-to-estimate densities. The new ĥpi,ˆl is quite robust against the choice of L whenever a sufficiently large value is taken for L. We decide to take L = 30. Regarding the choice of γ, small values of γ are more appropriate for easy-to-estimate densities, whereas large values of γ are more appropriate for hard-to-estimate densities. In order to find a compromise between these two situations we decide to take γ = 0.6. We expect to obtain a new data-based selector that presents a good overall performance for a wide range of density features. PhD Program UC UP February 18, 2011 32

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 New CV ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 PI.2 New CV PhD Program UC UP February 18, 2011 33

Finite sample behaviour ISE 0.02 0.04 0.06 0.08 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 PI.2 New CV ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 New CV PhD Program UC UP February 18, 2011 34

estimate for the Hidalgo Stamp Data 0 10 20 30 40 50 h = 0.0013 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) From this plot we can identify seven modes... seven different types of paper were used (probably). PhD Program UC UP February 18, 2011 35

Izenman, A.J., Sommer, C.J. (1988). Philatelic mixtures and multimodal densities. J. Amer. Statist. Assoc. 83, 941 953. Chacón, J.E., Tenreiro, C. (2008). A data-based method for choosing the number of pilot stages for plug-in selection. Pré-publicação 08-59 DMUC (revised, January 2011). Wand, M.P., Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall PhD Program UC UP February 18, 2011 36