O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1

Overview The nonparametric problem The Parzen-Rosenblatt kernel density Cross-validation and plug-in methods for selection Combining cross-validation and plug-in methods (based on a recent joint work with J.E. Chacón, Universidad de Extremadura, Spain) This is in order to obtain a data-based selector that presents an overall good performance for a large set of underlying densities. PhD Program UC UP February 18, 2011 2

We have observations X 1, X 2,...,X n independent and identically distributed real-valued random variables with unknown probability density function f: P(a < X < b) = for all < a < b < +. b a f(x)dx We want to estimate f based on the previous observations. The goal in nonparametric is to estimate f making only minimal assumptions about f. PhD Program UC UP February 18, 2011 3

Exploring data is one of the goals of nonparametric density estimation. Hidalgo Stamp Data: thickness of 485 postage stamps that were printed over a long time in Mexico during the 19th century. frequency 0 20 40 60 80 bin width = 0.003 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 thickness (in mm.) The idea is to gain insights into the number of different types of papers that were used to print the postage stamps. PhD Program UC UP February 18, 2011 4

estimation In this talk, we will restrict our attention to another well known density introduced by Rosenblatt (1956) and Parzen (1962): the kernel density. The motivation given by Rosenblatt (1956) for this density is based on the fact f(x) = F (x) where the cumulative distribution function F can be estimated by the empirical distribution function given by F n (x) = 1 n I n ],x] (X i ), with I A (y) = i=1 { 1 if y A 0 if y / A. PhD Program UC UP February 18, 2011 5

estimation If h is a small positive number we could expect that F(x + h) F(x h) f(x) 2h F n(x + h) F n (x h) 2h = 1 n 1 n 2h I ]x h,x+h](x i ) i=1 = 1 n ( ) x Xi K 0, nh h i=1 where K 0 ( ) = 1 2 I ] 1,1]( ). PhD Program UC UP February 18, 2011 6

estimation The Parzen-Rosenblatt kernel is obtained by replacing K 0 by a general symmetric density function K: where: f h (x) = 1 nh n ( ) x Xi K = 1 h n i=1 n K h (x X i ) h = h n, the or smoothing parameter, is a sequence of strictly positive real numbers converging to zero as n tends to infinity; K, the kernel, is a bounded and symmetric density function. i=1 Contrary to the histogram, the Parzen-Rosenblatt gives regular estimates for f if we take for K a regular density function. PhD Program UC UP February 18, 2011 7

estimation The choice of the is a crucial issue for kernel density estimation. estimates for the Hidalgo Stamp Data (n=485): 0 10 20 30 h = 0.005 0 10 20 30 40 50 60 h = 0.0005 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) K(x) = (2π) 1/2 e x2 /2 PhD Program UC UP February 18, 2011 8

The role played by h For the mean integrated square error MISE(f; n, h) = E {f h (x) f(x)} 2 dx, we have MISE(f; n, h) = Varf h (x)dx + 1 nh K 2 (u)du + h4 4 {Ef h (x) f(x)} 2 dx u 2 K(u)du f (x) 2 dx. If h is too small we obtain an with a small bias but with a large variability. If h is too large we obtain an with a large bias but with a small variability. PhD Program UC UP February 18, 2011 9

The role played by h Choosing the corresponds to balancing bias and variance. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.1 3 2 1 0 1 2 3 undersmoothing small h small bias but large variability PhD Program UC UP February 18, 2011 10

The role played by h Choosing the corresponds to balancing bias and variance. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.5 3 2 1 0 1 2 3 oversmoothing large h small variability but large bias PhD Program UC UP February 18, 2011 11

The role played by h The main challenge in smoothing is to determine how much smoothing to do. 0.0 0.1 0.2 0.3 0.4 n = 1000 and h = 0.36 3 2 1 0 1 2 3 almost right The choice of the kernel is not so relevant to the behaviour. PhD Program UC UP February 18, 2011 12

selectors Under general conditions on the kernel and on the underlying density function, for each n N there exists an optimal h MISE = h MISE (n; f) in the sense that MISE(f; n, h MISE ) MISE(f; n, h), for all h > 0. But h MISE depends on the unknown density f... We are interested in methods for choosing h that are based on the observations X 1,...,X n, h = h(x 1,...,X n ), and satisfy h(x 1,...,X n ) h MISE, for a large classe of densities. PhD Program UC UP February 18, 2011 13

Cross-validation selection For each h > 0, we start by considering an unbiased of MISE(f; n, h) R(f), given by CV(h) = R(K) nh + 1 n(n 1) i j ( n 1 n K h K h 2K h )(X i X j ), and we take for h the value ĥcv the minimises CV(h). (Rudemo, 1982; Bowman, 1984) Under some regularity conditions on f and K we have ĥ ( CV 1 = O p n 1/10). h MISE (Hall, 1983; Hall & Marron, 1987) PhD Program UC UP February 18, 2011 14

Plug-in selection The plug-in method is based on a simple idea that goes back to Woodroofe (1970). We start with an asymptotic approximation h 0 for the optimal h MISE : where and h 0 = c K ψ 1/5 4 n 1/5 c K = R(K) 1/5 ( u 2 K(u)du ) 2/5 ψ r = f (r) (x)f(x)dx, r = 0, 2, 4,... The plug-in selector is obtained by replacing the unknown quantities in h 0 by consistent s: ĥ PI = c ˆψ 1/5 K 4 n 1/5 PhD Program UC UP February 18, 2011 15

Estimating the functional ψ r A class of kernel s of ψ r was introduced by Hall and Marron (1987a, 1991) and Jones and Sheather (1991): ˆψ r (g) = 1 n 2 n i,j=1 U (r) g (X i X j ), where g is a new and U is a bounded, symmetric and r-times differentiable kernel. For U = φ, the that minimises the asymptotic mean square error of ˆψ r (g) is given by g 0,r = ( 2 φ (r) (0) ψ r+2 1) 1/(r+3) n 1/(r+3). In the case of the estimation of ψ 4, this depends (again) on the unknown quantity ψ 6! PhD Program UC UP February 18, 2011 16

Multistage plug-in estimation of ψ r In order to estimate ψ 4 we have then the following schema: to estimate we consider we need ψ 4 ˆψ4 (ĝ 0,4 ) ψ 6 ψ 6 ˆψ6 (ĝ 0,6 ) ψ 8 ψ 8 ˆψ8 (ĝ 0,8 ). ψ 10 ψ 4+2(l 1) ˆψ 4+2(l 1) (ĝ 0,4+2(l 1) ) ψ 4+2l where ĝ 0,r = ( 2 φ (r) (0) ˆψ r+2 1) 1/(r+3) n 1/(r+3). PhD Program UC UP February 18, 2011 17

Multistage plug-in estimation of ψ 4 The usual strategy to stop this cyclic process, is to use a parametric of ψ 4+2l based on some parametric reference distribution family. The standard choice for the reference distribution family is the normal or Gaussian family: f(x) = (2πσ) 1/2 e x2 /(2σ 2). In this case, ψ 4+2l is estimated by ˆψ NR 4+2l = φ(4+2l) (0)(2ˆσ 2 ) (5+2l)/2, where ˆσ denotes any scale estimate. PhD Program UC UP February 18, 2011 18

Multistage plug-in estimation of ψ 4 For a fixed l {1, 2,...} the l-stage of ψ 4 is: ˆψ 4+2l NR ĝ 0,4+2(l 1) ւ ˆψ 4+2(l 1) ĝ 0,4+2(l 2) ˆψ 4+2(l 2) ւ ĝ 0,4+2(l 3) ˆψ 4+2(l 3) ւ. ւ ĝ 0,4 ˆψ4 ˆψ 4,l PhD Program UC UP February 18, 2011 19

Multistage plug-in selector Depending on the number l {1, 2,...} of considered pilot stages of estimation we get different s ˆψ 4,l of ψ 4. The associated l-stage plug-in selector for the kernel density is given by ĥ PI,l = c ˆψ 1/5 K 4,l n 1/5 If f has bounded derivatives up to order 4 + 2l then ĥ PI,l h MISE 1 = O p ( n α ), with α = 2/7 for l = 1 and α = 5/14 for all l 2. (CT, 2003) PhD Program UC UP February 18, 2011 20

Two-stage plug-in selector For the standard choice l = 2 we have: ˆψ NR 8 ĝ 0,6 ˆψ 6 ւ ĝ 0,4 ˆψ 4 ˆψ 4,2 The associated two-stage plug-in selector is given by ĥ PI,2 = c ˆψ 1/5 K 4,2 n 1/5 PhD Program UC UP February 18, 2011 21

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 CV ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 PI.2 CV PhD Program UC UP February 18, 2011 22

Finite sample behaviour ISE 0.02 0.04 0.06 0.08 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 PI.2 CV ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 CV PhD Program UC UP February 18, 2011 23

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 0.020 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages PhD Program UC UP February 18, 2011 24

Finite sample behaviour ISE 0.02 0.06 0.10 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 1 3 5 7 9 11 13 15 17 19 number of stages PhD Program UC UP February 18, 2011 25

Multistage plug-in selector From a finite-sample point of view the performance of ĥpi,l strongly depends on the considered number of stages. For strongly skewed or asymmetric multimodal densities the standard choice l = 2 gives poor results. The natural question that arises from the previous considerations is: How can we choose the number of pilot stages l? This is an old question posed by Park and Marron (1992). In order to answer this question, the idea developed by Chacón and CT (2008) was to combine plug-in and cross-validation methods. PhD Program UC UP February 18, 2011 26

Combining plug-in and cross-validation procedures We started by fixing minimum and a maximum number of pilot stages L and L and by choosing a stage l among the set of possible pilot stages L = {L, L + 1,...,L} This is equivalent to select one of the s ĥ PI,l = c K ˆψ 1/5 4,l n 1/5, l L. Recall that each one of these s has good asymptotic properties. PhD Program UC UP February 18, 2011 27

Combining plug-in and cross-validation procedures In order to select one of the previous multistage plug-in s we consider a weighted version of the cross-validation criterion function given by CV γ (h) = R(K) nh + γ n(n 1) i j ( n 1 n K h K h 2K h )(X i X j ), for some 0 < γ 1 that needs to be fixed by the user. Finally, we take the ĥ PI,ˆl = c ˆψ 1/5 K n 1/5 4,ˆl where ˆl = argmin l L CV γ (ĥpi,l). PhD Program UC UP February 18, 2011 28

Asymptotic behaviour If f has bounded derivatives up to order 4 + 2L and ψ 4+2l ψ NR 4+2l (σ f), for all l = 1, 2,...,L, (1) then ĥ PI,ˆl h MISE 1 = O p ( n α ) with α = 2/7 for L = 1 and α = 5/14 for L 2. (Chacón & CT, 2008) Condition (1) is not very restrictive due to the smoothness of the normal distribution. This result justifies the recommendation of using L = 2. PhD Program UC UP February 18, 2011 29

Asymptotic behaviour Proof: From the definite-positivity property of the class of gaussian based kernels used in the multistage estimation process one can prove that P(Ω L,L ) 1 where } Ω L,L = {ĥpi,l ĥpi,l 1... ĥpi,l+1 ĥpi,l. The conclusion follows easily from the asymptotic behaviour of ĥ PI,L and ĥpi,l, since for a sample in Ω L,L we have ĥ PI,L h MISE 1 ĥ PI,ˆl(L) h MISE 1 ĥpi,l h MISE 1. PhD Program UC UP February 18, 2011 30

On the role played by L and γ ISE Standard normal density: 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 γ = 0.3 γ = 0.5 γ = 0.7 ISE 0.01 0.02 0.03 0.04 0.05 0.06 Claw density: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 3 2 1 0 1 2 3 γ = 0.3 γ = 0.5 γ = 0.7 PI.2 10 20 30 10 20 30 10 20 30 CV PI.2 10 20 30 10 20 30 10 20 30 CV L L Distribution of ISE(ĥPI,ˆl ) as a function of L and γ (n = 200) PhD Program UC UP February 18, 2011 31

Choosing L and γ in practice The boxplots show that a larger value for L is recommended especially for hard-to-estimate densities. The new ĥpi,ˆl is quite robust against the choice of L whenever a sufficiently large value is taken for L. We decide to take L = 30. Regarding the choice of γ, small values of γ are more appropriate for easy-to-estimate densities, whereas large values of γ are more appropriate for hard-to-estimate densities. In order to find a compromise between these two situations we decide to take γ = 0.6. We expect to obtain a new data-based selector that presents a good overall performance for a wide range of density features. PhD Program UC UP February 18, 2011 32

Finite sample behaviour ISE 0.000 0.005 0.010 0.015 Standard normal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 New CV ISE 0.000 0.010 0.020 Skewed unimodal density: 0.0 0.1 0.2 0.3 0.4 0.5 3 2 1 0 1 2 3 n = 400 PI.2 New CV PhD Program UC UP February 18, 2011 33

Finite sample behaviour ISE 0.02 0.04 0.06 0.08 Strongly skewed density: 0.0 0.4 0.8 1.2 3 2 1 0 1 2 3 n = 400 PI.2 New CV ISE 0.005 0.015 0.025 Asymmetric multimodal density: 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 n = 400 PI.2 New CV PhD Program UC UP February 18, 2011 34

estimate for the Hidalgo Stamp Data 0 10 20 30 40 50 h = 0.0013 0.06 0.08 0.10 0.12 0.14 thickness (in mm.) From this plot we can identify seven modes... seven different types of paper were used (probably). PhD Program UC UP February 18, 2011 35

Izenman, A.J., Sommer, C.J. (1988). Philatelic mixtures and multimodal densities. J. Amer. Statist. Assoc. 83, 941 953. Chacón, J.E., Tenreiro, C. (2008). A data-based method for choosing the number of pilot stages for plug-in selection. Pré-publicação 08-59 DMUC (revised, January 2011). Wand, M.P., Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall PhD Program UC UP February 18, 2011 36