Fast optimal bandwidth selection for kernel density estimation

Similar documents
Very fast optimal bandwidth selection for univariate kernel density estimation

Fast Exact Univariate Kernel Density Estimation

The Priestley-Chao Estimator

Improved Fast Gauss Transform and Efficient Kernel Density Estimation

Scalable machine learning for massive datasets: Fast summation algorithms

Kernel Density Estimation

Boosting Kernel Density Estimates: a Bias Reduction. Technique?

Kernel Density Based Linear Regression Estimate

Homework 1 Due: Wednesday, September 28, 2016

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

Improved Fast Gauss Transform. Fast Gauss Transform (FGT)

A = h w (1) Error Analysis Physics 141

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

lecture 26: Richardson extrapolation

Numerical Differentiation

Data-Based Optimal Bandwidth for Kernel Density Estimation of Statistical Samples

Logistic Kernel Estimator and Bandwidth Selection. for Density Function

Polynomial Interpolation

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

Improved Fast Gauss Transform. (based on work by Changjiang Yang and Vikas Raykar)

Handling Missing Data on Asymmetric Distribution

Solving Continuous Linear Least-Squares Problems by Iterated Projection

ON RENYI S ENTROPY ESTIMATION WITH ONE-DIMENSIONAL GAUSSIAN KERNELS. Septimia Sarbu

Basic Nonparametric Estimation Spring 2002

The Verlet Algorithm for Molecular Dynamics Simulations

Kernel Smoothing and Tolerance Intervals for Hierarchical Data

DEPARTMENT MATHEMATIK SCHWERPUNKT MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

Parameter Fitted Scheme for Singularly Perturbed Delay Differential Equations

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

Computational tractability of machine learning algorithms for tall fat data

Chapter 1. Density Estimation

Bootstrap confidence intervals in nonparametric regression without an additive model

Applications of the van Trees inequality to non-parametric estimation.

Artificial Neural Network Model Based Estimation of Finite Population Total

7 Semiparametric Methods and Partially Linear Regression

Polynomial Interpolation

New Streamfunction Approach for Magnetohydrodynamics

LECTURE 14 NUMERICAL INTEGRATION. Find

Order of Accuracy. ũ h u Ch p, (1)

Function Composition and Chain Rules

Local Orthogonal Polynomial Expansion (LOrPE) for Density Estimation

arxiv: v1 [math.pr] 28 Dec 2018

On Local Linear Regression Estimation of Finite Population Totals in Model Based Surveys

Exponential Concentration for Mutual Information Estimation with Application to Forests

An Empirical Bayesian interpretation and generalization of NL-means

VARIANCE ESTIMATION FOR COMBINED RATIO ESTIMATOR

Mathematics 5 Worksheet 11 Geometry, Tangency, and the Derivative

Regularized Regression

MATH745 Fall MATH745 Fall

IEOR 165 Lecture 10 Distribution Estimation

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING

Efficient Kernel Machines Using the Improved Fast Gauss Transform

Vikas Chandrakant Raykar Doctor of Philosophy, 2007

Copyright c 2008 Kevin Long

INFINITE ORDER CROSS-VALIDATED LOCAL POLYNOMIAL REGRESSION. 1. Introduction

Robust Average Derivative Estimation. February 2007 (Preliminary and Incomplete Do not quote without permission)

Bootstrap prediction intervals for Markov processes

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

Exercises for numerical differentiation. Øyvind Ryan

Analysis of Solar Generation and Weather Data in Smart Grid with Simultaneous Inference of Nonlinear Time Series

THE STURM-LIOUVILLE-TRANSFORMATION FOR THE SOLUTION OF VECTOR PARTIAL DIFFERENTIAL EQUATIONS. L. Trautmann, R. Rabenstein

Digital Filter Structures

Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a?

The Laplace equation, cylindrically or spherically symmetric case

Calculus I Practice Exam 1A

UNIMODAL KERNEL DENSITY ESTIMATION BY DATA SHARPENING

ERROR BOUNDS FOR THE METHODS OF GLIMM, GODUNOV AND LEVEQUE BRADLEY J. LUCIER*

Optimal parameters for a hierarchical grid data structure for contact detection in arbitrarily polydisperse particle systems

Fast large scale Gaussian process regression using the improved fast Gauss transform

Improved Fast Gauss Transform

Runge-Kutta methods. With orders of Taylor methods yet without derivatives of f (t, y(t))

Learning based super-resolution land cover mapping

232 Calculus and Structures

1 1. Rationalize the denominator and fully simplify the radical expression 3 3. Solution: = 1 = 3 3 = 2

DELFT UNIVERSITY OF TECHNOLOGY Faculty of Electrical Engineering, Mathematics and Computer Science

Preconditioning in H(div) and Applications

Volume 29, Issue 3. Existence of competitive equilibrium in economies with multi-member households

CHOOSING A KERNEL FOR CROSS-VALIDATION. A Dissertation OLGA SAVCHUK

5 Ordinary Differential Equations: Finite Difference Methods for Boundary Problems

STAT Homework X - Solutions

HOMEWORK HELP 2 FOR MATH 151

Precalculus Test 2 Practice Questions Page 1. Note: You can expect other types of questions on the test than the ones presented here!

Math 1241 Calculus Test 1

A Locally Adaptive Transformation Method of Boundary Correction in Kernel Density Estimation

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS

POLYNOMIAL AND SPLINE ESTIMATORS OF THE DISTRIBUTION FUNCTION WITH PRESCRIBED ACCURACY

The Complexity of Computing the MCD-Estimator

The total error in numerical differentiation

[db]

1. Consider the trigonometric function f(t) whose graph is shown below. Write down a possible formula for f(t).

4.2 - Richardson Extrapolation

NONLINEAR SYSTEMS IDENTIFICATION USING THE VOLTERRA MODEL. Georgeta Budura

Efficient algorithms for for clone items detection

Gradient Descent etc.

An Analysis of Locally Defined Principal Curves and Surfaces

Numerical Analysis MTH603. dy dt = = (0) , y n+1. We obtain yn. Therefore. and. Copyright Virtual University of Pakistan 1

Transcription:

Fast optimal bandwidt selection for kernel density estimation Vikas Candrakant Raykar and Ramani Duraiswami Dept of computer science and UMIACS, University of Maryland, CollegePark {vikas,ramani}@csumdedu Abstract We propose a computationally efficient ɛ exact approximation algoritm for univariate Gaussian kernel based density derivative estimation tat reduces te computational complexity from O(MN) to linear O(N +M) We apply te procedure to estimate te optimal bandwidt for kernel density estimation We demonstrate te speedup acieved on tis problem using te solve-te-equation plug-in metod, and on exploratory projection pursuit tecniques Introduction Kernel density estimation tecniques ] are widely used in various inference procedures in macine learning, data mining, pattern recognition, and computer vision Efficient use of tese metods requires te optimal selection of te bandwidt of te kernel A series of tecniques ave been proposed for data-driven bandwidt selection 4] Te most successful state of te art metods rely on te estimation of general integrated squared density derivative functionals Tis is te most computationally intensive task wit O(N ) cost, in addition to te O(N ) cost of computing te kernel density estimate Te core task is to efficiently compute an estimate of te density derivative Currently te most practically successful approac, solve-te-equation plugin metod 9] involves te numerical solution of a nonlinear equation Iterative metods to solve tis equation will involve repeated use of te density functional estimator for different bandwidts wic adds muc to te computational burden Estimation of density derivatives is needed in various oter applications like estimation of modes and inflexion points of densities ] and estimation of te derivatives of te projection index in projection pursuit algoritms 5] Optimal bandwidt selection A univariate random variable X on R as a density p if, for all Borel sets A of R, p(x)dx = Prx A] A Te task of density estimation is to estimate p from an iid sample x,, x N drawn from p Te estimate p : R (R) N R is called te density estimate Te most popular non-parametric metod for density estimation is te kernel density estimator (KDE) ] () p(x) = ( ) x xi K, N were K(u) is te kernel function and is te bandwidt Te kernel K(u) is required to satisfy te following two conditions: () K(u) and K(u)du = Te most widely used kernel is te Gaussian of zero mean and unit variance In tis case te KDE can be written as (3) p(x) = N e (x x i) / π R Te computational cost of evaluating Eq 3 at N points is O(N ), making it proibitively expensive Te Fast Gauss Transform (FGT) 3] is an approximation algoritm tat reduces te computational complexity to O(N), at te expense of reduced precision Yang et al ] presented an extension te improved fast Gauss transform(ifgt) tat scaled well wit dimensions Te main contribution of te current paper is te extension of te IFGT to accelerate te kernel density derivative estimate, and solve te optimal bandwidt problem Te integrated square error (ISE) between te estimate p(x) and te actual density p(x) is given by ISE( p, p) = R p(x) p(x)] dx Te ISE depends on a particular realization of N points It can be averaged over tese realizations to get te mean integrated squared error (MISE) An asymptotic large sample approximation for MISE, te AMISE, is usually derived via te Taylor s series Te A ere is for asymptotic Based on certain assumptions, te AMISE between te actual density and te estimate can be sown to be (4) AMISE( p, p) = N R(K) + 4 4 µ (K) R(p ), were, R(g) = R g(x) dx, µ (g) = R x g(x)dx, and p is te second derivative of te density p Te first term in Eq 4 is te integrated variance and te second term is te integrated squared bias Te bias is proportional to 4 wereas te variance is proportional to /N, wic leads to te well known bias-variance tradeoff Based on te AMISE expression te optimal bandwidt AMISE can be obtained by differentiating Eq 4 wrt bandwidt and setting it to zero ] /5 (5) AMISE = R(K) µ (K) R(p )N

However tis expression cannot be used directly since R(p ) depends on te second derivative of te density p In order to estimate R(p ) we will need an estimate of te density derivative A simple estimator for te density derivative can be obtained by taking te derivative of te KDE p(x) defined earlier ] Te r t density derivative estimate p (r) (x) can be written as (6) p (r) (x) = N r+ ( ) x K (r) xi, were K (r) is te r t derivative of te kernel K Te r t derivative of te Gaussian kernel k(u) is given by K (r) (u) = ( ) r H r (u)k(u), were H r (u) is te r t Hermite polynomial Hence te density derivative estimate can be written as (7) p (r) ( ) r ( ) x xi (x) = H r e (x x i) / πn r+ Te computational complexity of evaluating te r t derivative of te density estimate due to N points at M target locations is O(rN M) Based on similar analysis te optimal bandwidt r AMISE to estimate te rt density derivative can be sown to be ] R(K (8) r (r) ] /r+5 )(r + ) AMISE = µ (K) R(p (r+) )N 3 Estimation of density functionals Rater tan requiring te actual density derivative, metods for automatic bandwidt selection require te estimation of wat are known as density functionals Te general integrated squared density derivative functional is defined as R(p (s) ) = R p (s) (x) ] dx Using integration by parts, tis can be written in te following form, R(p (s) ) = ( ) s R p(s) (x)p(x)dx More specifically for even s we are interested in estimating density functionals of te form, ] (39) Φ r = p (r) (x)p(x)dx = E p (r) (X) R An estimator for Φ r is, (3) Φr = N p (r) (x i ) were p (r) (x i ) is te estimate of te r t derivative of te density p(x) at x = x i Using a kernel density derivative estimate for p (r) (x i ) (Eq 6) we ave (3) Φr = N r+ j= K (r) ( x i x j ) It sould be noted tat computation of Φ r is O(rN ) and ence can be very expensive if a direct algoritm is used Te optimal bandwidt for estimating te density functional is given by ] K (3) r (r) ] /r+3 () AMSE = µ (K)Φ r+ N 4 Solve-te-equation plug-in metod Te most successful among all current bandwidt selection metods 4], bot empirically and teoretically, is te solve-te-equation plug-in metod 4] We use te following version as described in 9] Te AMISE optimal bandwidt is te solution to te equation ] /5 R(K) (43) = µ (K) Φ, 4 γ()]n were Φ 4 γ()] is an estimate of Φ 4 = R(p ) using te pilot bandwidt γ(), wic depends on Te bandwidt is cosen suc tat it minimizes te asymptotic MSE for te estimation of Φ 4 and is K (4) ] /7 () (44) g MSE = µ (K)Φ 6 N Substituting for N, g MSE can be written as a function of as follows K (4) ] /7 ()µ (K)Φ 4 (45) g MSE = 5/7 AMISE R(K)Φ 6 Tis suggests tat we set ] /7 K (4) ()µ (K) Φ 4 (g ) (46) γ() = 5/7, R(K) Φ 6 (g ) were Φ 4 (g ) and Φ 6 (g ) are estimates of Φ 4 and Φ 6 using bandwidts g and g respectively Te bandwidts g and g are cosen suc tat it minimizes te asymptotic MSE (47) g = 6 π Φ6 N ] /7 g = 3 π Φ8 N ] /9 were Φ 6 and Φ 8 are estimators for Φ 6 and Φ 8 respectively We can use a similar strategy for estimation of Φ 6 and Φ 8 However tis problem will continue since te optimal bandwidt for estimating Φ r will depend on Φ r+ Te usual strategy is to estimate a Φ r at some stage, using a quick and simple estimate of bandwidt cosen wit reference to a parametric family, usually a normal density It as been observed tat as te number of stages increases, te variance of te bandwidt increases Te most common coice is to use only two stages If p is a normal density wit variance σ ten for

even r we can compute Φ r exactly ] An estimator of Φ r will use an estimate σ of te variance Based on tis we can estimate Φ 6 and Φ 8 as (48) Φ6 = 5 6 π σ 7, Φ8 = 5 3 π σ 9 Te two stage solve-te-equation metod using te Gaussian kernel can be summarized as follows () Compute an estimate σ of te standard deviation () Estimate te density functionals Φ 6 and Φ 8 using te normal scale rule (Eq 48) (3) Estimate te density functionals Φ 4 and Φ 6 using Eq 3 wit te optimal bandwidt g and g (Eq 47) (4) Te bandwidt is te solution to te nonlinear Eq 43 wic can be solved using any numerical routine like te Newton- Rapson metod Te main computational bottleneck is te estimation of Φ wic is of O(N ) 5 Fast density derivative estimation To estimate te density derivative at M target points, {y j R} M j=, we need to evaluate sums suc as (59) ( ) yj x i G r (y j ) = q i H r e (yj xi) / j =,, M, were {q i R} N will be referred to as te source weigts, R + is te bandwidt of te Gaussian and R + is te bandwidt of te Hermite polynomial Te computational complexity of evaluating Eq 59 is O(rNM) For any given ɛ > te algoritm computes an approximation Ĝr(y j ) suc tat (5) Ĝr(y j ) G r (y j ) Qɛ, were Q = N q i We call Ĝ r (y j ) an ɛ exact approximation to G r (y j ) We describe te algoritm briefly More details can be found in 8] For any point x R te Gaussian can be written as, (5) e y j x i / = e x i x / e y j x / e (x i x )(y j x )/ In Eq 5 for te tird exponential e (y j x )(x i x )/ te source and target are entangled Tis entanglement is separated using te Taylor s series expansion as follows e (x x )(y x )/ = (5) p k= k + error, ( ) k ( x x y x Using tis te Gaussian can now be factorized as p e y j x i / k ( ) ] k = e x i x / xi x k= ( ) ] k e y j x / yj x (53) + err ) k Te r t Hermite polynomial can be factorized as ] ( ) r/ yj x i r l ( ) m xi x H r = a lm l= m= ( ) r l m yj x (54), were (55) a lm = ( ) l+m r! l l!m!(r l m)! Using Eq 53 and 54, G r (y j ) after ignoring te error terms can be approximated as B km = k p Ĝ r (y j ) = k= r/ l= r l m= a lm B km e y j x / ( ) k ( ) r l m yj x yj x, were ( ) k ( ) m q i e x i x / xi x xi x Tus far, we ave used te Taylor s series expansion about a certain point x However if we use te same x for all te points we typically would require very ig truncation number p since te Taylor s series gives good approximation only in a small open interval around x We uniformly sub-divide te space into K intervals of lengt r x Te N source points are assigned into K clusters, S n for n =,, K wit c n being te center of eac cluster Te aggregated coefficients are now computed for eac cluster and te total contribution from all te clusters is summed up Since te Gaussian decays very rapidly a furter speedup is acieved if we ignore all te sources belonging to a cluster if te cluster is greater tan a certain distance from te target point, ie, y j c n > r y Substituting = and = te final algoritm can be written as (56) Ĝ r (y j ) = p r/ r l a lm B n km y j c n r y k= l= m= ) k+r l m e y j c n / ( yj c n Bkm n = ( ) k+m q i e xi cn / xi c n x i S n 5 Computational and space complexity Computing te coefficients Bkm n for all te clusters is O(prN) Evaluation of Ĝ r (y j ) at M points is O(npr M), were n if te maximum number of neigbor clusters wic influence y j Hence te total computational complexity is O(prN + npr M) For eac cluster we need to store all te pr coefficients Hence te storage needed is of O(prK + N + M)

Time (sec) 6 4 Direct Fast Max abs error / Q 6 8 Desired error Acutal error 4 8 6 4 3 3 3 35 3 5 5 5 7 3 3 4 6 N 4 4 6 N Figure : Running time in seconds and maximum absolute error relative to Q for te direct and te fast metods as a function of N N = M source and te target points were uniformly distributed, ] =, r = 4, and ɛ = 6 ] 5 Coosing te parameters Given any ɛ >, we want to coose te following parameters, K (te number of intervals), r y (te cut off radius for eac cluster), and p (te truncation number) suc tat for any target point y j, Ĝr(y j ) G r (y j ) Qɛ We give te final results for te coice of te parameters Te detailed derivations can be seen in te tecnical report 8] Te number of clusters K is cosen suc tat r x = / Te cutoff radius r y is given by = r x + ln ( r!/ɛ) Te truncation number p r y is cosen suc tat b=min (b,r y ), a=rx] = r! p! ɛ, were, ( ab ) p e (a b) /4, and b = a+ a +8p 53 Numerical experiments Te algoritm was programmed in C++ and was run on a 6 GHz Pentium M processor wit 5Mb of RAM Figure sows te running time and te maximum absolute error relative to Q for bot te direct and te fast metods as a function of N = M We see tat te running time of te fast metod grows linearly as te number of sources and targets increases, wile tat of te direct evaluation grows quadratically 6 Speedup acieved for bandwidt estimation We demonstrate te speedup acieved on te mixture of normal densities used by Marron and Wand 6] Te family of normal mixture densities is extremely ric and, in fact any density can be approximated arbitrarily well by a member of tis family Fig sows a sample of four different densities out of te fifteen densities wic were used by te autors in 6] as typical representatives of te densities likely to be encountered in real data situations We sampled N = 5, points from eac density Te AMISE optimal bandwidt was estimated using bot te direct metods and te proposed fast metod Table sows te speedup acieved and te 7 6 5 4 3 3 3 (c) 4 35 3 5 5 5 3 3 Figure : Four normal mixture densities from Marron and Wand 6]Te solid line sows te actual density and te dotted line is te estimated density using te optimal bandwidt absolute relative error We also used te Adult database from te UCI macine learning repository 7] Te database extracted from te census bureau database contains 3,56 training instances wit 4 attributes per instance Table sows te speedup acieved and te absolute relative error for two of te continuous attributes 7 Projection Pursuit Projection Pursuit (PP) is an exploratory tecnique for visualizing and analyzing large multivariate datasets 5] Te idea of PP is to searc for projections from ig- to low-dimensional space tat are most interesting Te PP algoritm for finding te most interesting one-dimensional subspace is as follows First project eac data point onto te direction vector a R d, ie, z i = a T x i Compute te univariate nonparametric kernel density estimate, p, of te projected points z i Compute te projection index I based on te density estimate Locally optimize over te te coice of a, to get te most interesting projection of te data Repeat from a new initial projection to get a different view Te projection index is designed to reveal specific structure in te data, like clusters, outliers, or smoot manifolds Te entropy index based on Rényi s order- entropy is given by I = p(z) log p(z)dz Te density of zero mean and unit variance wic uniquely minimizes tis is te standard normal density Tus te projection index finds te direction wic is most non-normal In practice we need to use an estimate p of te te true density p, for example te KDE using te Gaussian kernel Tus we ave an estimate of te entropy index as follows (77) Î = log p(z)p(z)dz = N (d) log p(a T x i )

Table : Te running time in seconds for te direct and te fast metods for four normal mixture densities of Marron and Wand 6] (See Fig ) Te absolute relative error is defined as direct fast / direct For te fast metod we used ɛ = 3 Density direct fast T direct (sec) T fast (sec) Speedup Abs Relative Error 543 543 8536 6 8387 53e-6 94 94 5989 886 6679 634e-6 (c) 436 436 7867 67 6769 84e-6 (d) 349 3493 839 9 6983 383e-6 Table : Optimal bandwidt estimation for two continuous attributes for te Adult database 7] Attribute direct fast T direct (sec) T fast (sec) Speedup Abs Relative Error Age 86846 86856 46793 664 745 7e-5 fnlwgt 499564359 499584 46379 6883 6737 49e-6 Te entropy index Î as to be optimized over te d-dimensional vector a subject to te constraint tat a = Te optimization function will require te gradient of te objective function For te index defined above te gradient can be written as d da Î] = N p (a T x i ) N p(a T x i ) x i For te PP te computational burden is greatly reduced if we use te proposed fast metod Te computational burden is reduced in te following tree instances () Computation of te kernel density estimate, () estimation of te optimal bandwidt, and (3) computation of te first derivative of te kernel density estimate, wic is required in te optimization procedure Fig 3 sows an example of te PP algoritm to segment an image based on color 8 Conclusions We proposed an fast ɛ exact algoritm for kernel density derivative estimation wic reduced te computational complexity from O(N ) to O(N) We demonstrated te speedup acieved for optimal bandwidt estimation bot on simulated as well as real data A extended version of tis paper is available as a tecnical report 8] References ] P K Battacarya Estimation of a probability density function and its derivatives Sankya, Series A, 9:373 38, 967 ] K Fukunaga and L Hostetler Te estimation of te gradient of a density function, wit applications in pattern recognition IEEE Trans Info Teory, ():3 4, 975 3] L Greengard and J Strain Te fast Gauss transform SIAM J Sci Stat Comput, ():79 94, 99 4] M C Jones, J S Marron, and S J Seater A brief survey of bandwidt selection for density estimation J Amer Stat Assoc, 9(433):4 47, Marc 996 5] M C Jones and R Sibson Wat is projection pursuit? J R Statist Soc A, 5: 36, 987 5 5 (c) 3 Figure 3: Te original image Te centered and scaled RGB space Eac pixel in te image is a point in te RGB space (c) KDE of te projection of te pixels on te most interesting direction found by projection pursuit (d) Te assignment of te pixels to te tree modes in te KDE 6] J S Marron and M P Wand Exact mean integrated squared error Te Ann of Stat, ():7 736, 99 7] D J Newman, S Hettic, C L Blake, and C J Merz UCI repository of macine learning databases ttp://wwwicsuciedu/ mlearn/mlrepositorytml, 998 8] V C Raykar and R Duraiswami Very fast optimal bandwidt selection for univariate kernel density estimation Tecnical Report CS-TR-4774, University of Maryland, CollegePark, 5 9] SJ Seater and MC Jones A reliable data-based bandwidt selection metod for kernel density estimation J Royal Statist Soc B, 53:683 69, 99 ] M P Wand and M C Jones Kernel Smooting Capman and Hall, 995 ] C Yang, R Duraiswami, N Gumerov, and L Davis Improved fast Gauss transform and efficient kernel density estimation In IEEE Int Conf on Computer Vision, pages 464 47, 3 5 5 5 5 (d) 4