Nonparametric Density Estimation (Multidimension)

Nonparametric Density Estimation (Multidimension) Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann February 19, 2007

Setup One-dimensional estimation Multivariate estimation Consider a d-dimensional data set with sample size n X i = X i1. X id, i = 1,..., n. Goal: Estimate the density f of X = (X 1,..., X d ) T f (x) = f (x 1,..., x d )

Multivariate kernel density estimator Kernel density estimator in d-dimensions ˆf h (x) = 1 n = 1 n n i=1 n i=1 ( 1 x h d K Xi h 1 h d K ) ( x1 X i1 h,..., x ) d X id h where K is a multivariate kernel function with d arguments. Note: h is the same for each components.

Multivariate kernel density estimator Extension: Bandwidths: h = (h 1,..., h d ) T ˆf h (x) = 1 n n i=1 ( 1 x1 X i1 K,..., x ) d X id h 1...h d h 1 h d

Kernel function What form should the multidim. kernel K(u) = K(u 1,..., u d ) take? Multiplicative kernel: K(u) = K(u 1 )... K(u d ) where K is a univariate kernel function. ˆf h (x) = 1 n ( 1 x1 X i1 K,..., x ) d X id n h 1...h d h 1 h d i=1 = 1 n d ( ) 1 xj X ij K n h j i=1 j=1 h j Note: Contributions to the sum only in the cube: X i1 [x 1 h 1, x 1 + h 1 ),..., X id [x d h d, x d + h d )

Kernel function Spherical/radial-symmetric kernel: or K(u) K( u ) K(u) = K( u ) R K( u ) d where u = u T u. (Exercise 3.13) The multivariate Epanechnikov (spherical): K(u) (1 u T u)1 (u T u 1) The multivariate Epanechnikov (multiplicative): K(u) = ( ) 3 d (1 u 2 4 1)1 ( u1 1)...(1 ud 2 )1 ( u d 1)

Kernel function Epanechnikov kernel function Equal bandwidth in each direction: h = (h 1, h 2 ) T = (1, 1) T

Kernel function Epanechnikov kernel function Different bandwidth in each direction: h = (h 1, h 2 ) T = (1, 0.5) T

Multivariate kernel density estimator The general form for the multivariate density estimator with bandwidth matrix H (nonsingular) ˆf H (x) = 1 n = 1 n n i=1 1 det(h) K ( H 1 (x X i ) ) n K H (x X i ) i=1 where K H ( ) = 1 det(h) K(H 1 )

Multivariate kernel density estimator The bandwidth matrix includes all simpler cases. Equal bandwidth h: H = hi d where I d is the d d identity matrix. Different bandwidths h 1,..., h d : H = diag(h 1,..., h d )

Multivariate kernel density estimator What effect has the off-diagonal elements? Rule-of-Thumb: Use a bandwidth matrix proportional to ˆΣ 1 2, where ˆΣ is the covariance matrix of the data. Such a bandwidth corresponds to a transformation of the data, so that they have an identity covariance matrix, ie. we can use bandwidths matrics to adjust for correlation between the components.

Kernel function Epanechnikov kernel function Bandwidth matrix: ( 1 0.5 H = 0.5 1 )

Properties of the kernel function K is a density function R d K(u) du = 1 and K(u) 0 K is symmetric R d uk(u) du = 0 d K has a second moment (matrix) uu T K(u) du = µ 2 (K)I d R d where I d denotes the d d identity matrix K has a kernel norm K 2 2 = K 2 (u) du

Properties of the kernel function K is a density function. Therefore is also ˆf H a density function ˆf H (x) dx = 1 The estimate is consistent in any point x ˆf H (x) = 1 n n K H (X i x) P f (x) i=1

Statistical Properties Bias: ) E (ˆfH (x) f (x) 1 2 µ 2(K)tr{H T H f (x)h} Variance: ) V (ˆf H (x) AMISE: AMISE(H) = 1 4 µ2 2(K) 1 n det(h) K 2 2f (x) tr{h T H f (x)h} 2 dx + 1 n det(h) K 2 2 where H f is the Hessian matrix and K 2 2 squared L 2 -norm af K. is the d dimensional

Special case Univariate case: For d = 1 we obtain H = h, K = K, H f (x) = f (x) Bias: ) E (ˆf H (x) f (x) 1 2 µ 2(K)tr{H T H f (x)h} 1 2 µ 2(K)h 2 f (x) Variance: ) V (ˆfH (x) 1 n det(h) K 2 2f (x) 1 nh K 2 2f (x)

Bandwidth selection AMISE optimal bandwidth: We have a bias-variance trade-off which is solved in the AMISE optimal bandwidth. h is a scalar, H = hh 0 and det(h 0 ) = 1, then AMISE(H) = 1 [ ] 2 4 h4 µ 2 2(K) tr{h T 1 0 H f (x)h 0 } dx + nh d K 2 2 Then the optimal bandwidth and the optimal AMISE are h opt n 1/(4+d), AMISE(h opt H 0 ) n 4/(4+d) Note: The multivariate density estimator has a slower rate of convergens compared to the univariate one. H = hi d and fix sample size n: The AMISE optimal bandwidth larger in higher dimensions.

Bandwidth selection Bandwidth selection: Plug-in method (rule-of-thumb, generalized Silvermann rule-of-thumb) Cross-validation method

Bandwidth selection Plug-in method Idea: Optimize AMISE under the assumption that f is multivariate normal distribution N d (µ, Σ) and K is a multivariate Gaussian, ie. N d (0, I), then µ 2 (K) = 1 K 2 2 = 2 d π d/2 Then = tr{h T H f (x)h} 2 dx 1 2 d+2 π d/2 det(σ) 1/2 [2tr(HT Σ 1 H) 2 + {tr(h T Σ 1 H)} 2 ]

Bandwidth selection Simple case: H = diag(h 1,..., h d ) and Σ = diag(σ 1,..., σ d ), then ( ) 4 1/(d+4) h j = n 1/(d+4) σ j d + 2 }{{} C Silverman s rule-of-thumb (d = 1): ( 4ˆσ 5 ĥ rot = 3n ) 1/5

Bandwidth selection Replace σ j with ˆσ j and notice that C always is between 0.924 (d = 11) and 1.059 (d = 1): Scott s rule ĥ j = n 1/(d+4)ˆσ j It is not possibel to derive the rule-of-thumb in the general case, but it might be a good idea to choose the bandwidth matrix proportional to the covariance matrix. Generalization of Scott s rule: 1/(d+4) Ĥ = n ˆΣ1/2

Bandwidth selection Cross-validation: ISE(H) = = ) 2 (ˆfH (x) f (x) dx ˆf 2 H (x) dx }{{} Cal. from data Estimate of the expectation + Eˆf H (X) = 1 n f 2 (x) dx } {{ } Ignore n ˆf H, i (X i ) i=1 ) 2 (ˆfH f (x) dx }{{} =Eˆf H (X) where the multivariate version of the leave-one-out estimator is ˆf H, i (x) = 1 n K H (X j x) n 1 j=1,j i

Bandwidth selection Multivariate cross-validation criterion: CV(H) = 1 n 2 det(h) 2 n(n 1) n i=1 j=1 n n K K { H 1 (X j X i ) } n i=1 j=1,j i K H (X j X i ) Note: The bandwidths is a d d matrix H which means we have to minimize over d(d+1) 2 parameters. Even if H is diagonal matrix, we have a d-dimensional optimization problem.

Canonical bandwidths The canonical bandwidth of kernel j { } K δ j 2 1/(d+4) = 2 µ 2 (K ) 2 Therefore where AMISE(H j, K j ) = AMISE(H i, K i ) H i = δi δ j Hj

Canonical bandwidths Example: Adjust from Gaussian to Quartic product kernel d δ G δ Q δ Q /δ G 1 0.7764 2.0362 2.6226 2 0.6558 1.7100 2.6073 3 0.5814 1.5095 2.5964 4 0.5311 1.3747 2.5883 5 0.4951 1.2783 2.5820

Graphical representation Example: Two-dimensions Est-West German migration intention in Spring 1991. Explanatory variables: Age and household income Two-dimensional nonparametric density estimate ˆf h (x) = ˆf h (x 1, x 2 ) where the bandwidth matrix H = diag(h)

Graphical representation Contour plot

Graphical representation Example: Three-dimensions How can we display three- or even higher diemsional density estimates? Hold one variable fix and plot the density function depending on the other variables. For three-dimensions we have x 1, x 2 vs. ˆf h (x 1, x 2, x 3 ) x 1, x 3 vs. ˆf h (x 1, x 2, x 3 ) x 2, x 3 vs. ˆf h (x 1, x 2, x 3 )

Graphical representation Example: Three-dimensions Credit scoring sample. Explanatory variables: Duration of the credit, household income and age.

Graphical representation Contour plot