Dual-Tree Fast Gauss Transforms

Size: px

Start display at page:

Download "Dual-Tree Fast Gauss Transforms"

Antony Barton
6 years ago
Views:

1 Dual-Tree Fast Gauss Transforms Dongryeol Lee Computer Science Carnegie Mellon Univ. Alexander Gray Computer Science Carnegie Mellon Univ. Andrew Moore Computer Science Carnegie Mellon Univ. Abstract In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. This paper will focus on the common form G(x q = N xq xr e h i.e. where the kernel is the Gaussian kernel with scaling parameter, or bandwidth h, there are N reference points x r, and we desire the sum for N Q different query points x q. Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. It was applied to the problem of kernel density estimation in [7] using a simple These include the Barnes-Hut algorithm [], the Fast Multipole Method [8], Appel s algorithm [], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points, is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets. r=

2 Figure : The basic idea is to approximate the kernel sum contribution of some subset of the reference points, lying in some compact region of space with centroid x, to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points Q lying in some region of space Q, with centroid x Q. finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions h n (t by h n (t = e t H n (t, where the Hermite polynomials H n (t are defined by the odrigues formula: H n (t = ( n e t D n e t, t. After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion N xq xr N G(x q = e h = r= r= r= r= α! ( xr x h α ( xq x h α h where we ve adopted the usual multi-index notation as in [9]. This can be re-written as N xq xr N ( ( α G(x q = e h = α! h xr x Q xq x Q α ( h h to express the sum as a Taylor (local expansion about a nearby representative centroid x Q in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and okhlin [8] showed that it is useful to think in terms of a set of three translation operators for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT [9] was based on a flat grid, and thus provided only one operator ( HL of the next section, with an associated error bound (which was unfortunately incorrect. The Improved Fast Gauss Transform (IFGT [4] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. (

3 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma.. Hermite-to-local (HL translation operator for Gaussian kernel (as presented in Lemma. in [9, 0]: Given a reference node, a query node Q, and the Hermite expansion centered at a centroid x of : G(x q = ( α h, the Taylor expansion of the Hermite expansion at the centroid x Q of the query node Q is given by G(x q = ( β B Q ( β h where Bβ = ( β xq x α+β h. Proof. (sketch The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. [ N ( α! ] α x h r x ( h α h by interchanging Note that we can rewrite G(x q = r= the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma.. Hermite-to-Hermite (HH translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid x in a reference node : G(x q = A ( α h α h, this same Hermite expansion shifted to a new location x of the parent node of is given by G(x q = ( A γ h γ h where γ 0 A γ = ( x x γ α. h 0 α γ (γ α! A α Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: G(x q = «A xq x αh α h = «β «A x x α ( β xq x h α+β h h where γ = α + β. = A α = A α = 4 γ 0 0 α γ «β x x ( β h α+β h «β x x h α+β h (γ α! A α «xq x h 3 «γ α x x h «xq x h 5 h γ xq x h «

4 The next operator acts as a clean-up routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the LL operator shifts a node s local expansion to the centroid of each child. β α α!(β α! B β Lemma.3. Local-to-local (LL translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid x Q of a query node Q : G(x q = ( B Q β, β h the Taylor expansion obtained by shifting [ this expansion to the new centroid ] x Q of the child node Q is G(x q = ( xq x Q β α ( Q α. h h Proof. Applying the multinomial theorem to to expand about the new center x Q yields: G(x q = «xq x β Q B β h = «xq x β α «α Q xq x Q B β. α!(β α! α β h h whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking p D terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 0] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. Lemma.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]: Suppose we are given an Hermite expansion of a reference node about its centroid x : G(x q = ( α h where A α = N ( xr x α. h α! For any query point xq, the r= error due to truncating the series after the first p D D ( term is ǫ M (p N D ( r D k ( ( D k r p k r p p! where xr satisfies x r x < rh for r <. Proof. (sketch We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [3]: n! h n(x n n! e x, n 0, x. Lemma.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid x Q of a query node G(x q = ( β B Q β h where Strain [] proposed the interesting idea of using Stirling s formula (for any non-negative integer n: ` n+ n n! to lift the node size constraint; one might imagine that this could allow approximation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [] e were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here.

5 B β = ( β ( xq x α+β h and A α s are the coefficients of the Hermite expansion centered at the reference node centroid x. Then, truncating the series after p D ( ( D k terms satisfies the error bound ǫ L (p D k ( r p k r p p! where x q x Q < rh for r <, x q Q. = D N ( r D Proof. Taylor expansion of the Hermite function yields xq xr e h = ( β «α xr x xq x h α+β α! h h = ( β «α x x r ( α h α+β α! h ««( β β xq x r xq x Q h β h h Use e r h ««β xq x Q h xq x h ««β xq x Q h = D (u p (x qi, x ri, x Qi + v p (x qi, x ri, x Qi for i D, where i= u p(x qi, x ri, x Qi = v p(x qi, x ri, x Qi = p n i =0 n i =p ( n i xqi x ri h ni h ( n i xqi x ri h ni h ««ni h ««ni h. These univariate functions respectively satisfy u p (x qi, x ri, x Qi rp r v p (x qi, x ri, x Qi r p p! r, for i D, achieving the multivariate bound. Lemma.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about the centroid x of a reference node G(x q = ( α h has the following α<p Taylor expansion about the centroid x Q of a query node: G(x q = ( β C Q β h where the coefficients C β are given by C β = ( β α<p α+β ( xq x h and. Truncating the series after p D D N terms satisfies the error bound ǫ L (p ( D ( r D k (( ( (r p k ((r p ( (r p D k p! for a query node Q for which x q x Q < rh, and a reference node for which x r x < rh for r <, x q Q, x r. Proof. We define u pi = u p (x qi, x ri, x Qi, x i, v pi = v p (x qi, x ri, x Qi, x i, w pi = w p (x qi, x ri, x Qi, x i for i D: u pi = v pi = p n i =0 p n i =0 ( n i ( n i p n j =0 n j =p n j! n j! «nj xi x ri h ( n j h ni+nj «nj xi x ri h ( n j h ni+nj xqi x i h xqi x i h ««ni h ««ni h

6 w pi = n i =p Note that e ( n i n j =0 r h n j! «nj xi x ri h ( n j h ni+nj xqi x i h ««ni h = D (u pi + v pi + w pi for i D. Using the bound for i= Hermite functions and the property of geometric series, we obtain the following upper bounds: p p «u pi (r n i (r n j (r p = r Therefore, h e xq xr v pi p p! G(xq β<p n i =0 n j =0 n i =0 n j =p w pi p! DY i= (r n i (r n j = ««(r p (r p p! r r n i =p n j =0 D u ( r pi D «xq x Q C β h β (r n i (r n j = p! N ( r D r ««(r p r! «D ((r (( (r p k p ( (r p D k k p! D D ((r (( (r p k p ( (r p «D k k p! 3 Algorithm and esults Algorithm. The algorithm mainly consists of making the function call DFGT(Q.root,.root, i.e. calling the recursive function DFGT( with the root nodes of the query tree and reference tree. After the DFGT( routine is completed, the pre-order traversal of the query tree implied by the LL operator is performed. Before the DFGT( routine is called, the reference tree could be initialized with Hermite coefficients stored in each node using the HH translation operator, but instead we will compute them as needed on the fly. It adaptively chooses among three possible methods for approximating the summation contribution of the points in node to the queries in node Q, which are self-explanatory, based on crude operation count estimates. G min Q, a running lower bound on the kernel sum G(x q for any x q Q, is used to ensure locally that the global relative error is ǫ or less. This automatic mechanism allows the user to specify only an error tolerance ǫ rather than other tweak parameters. Upon approximation, the upper and lower bounds on G for Q and all its children are updated; the latter can be done in an O( delayed fashion as in [7]. The remainder of the routine implements the characteristic four-way dual-tree recursion. We also tested a hybrid method (DFGTH which approximates if either of the DFD or DFGT approximation criteria are met. Experimental results. We empirically studied the runtime 3 performance of five algorithms on five real-world datasets for kernel density estimation at every query point with a range of bandwidths, from 3 orders of magnitude smaller than optimal to three orders larger than optimal, according to the standard least-squares cross-validation score []. The naive 3 All times include all preprocessing costs including any data structure construction. Times are measured in CPU seconds on a dual-processor AMD Opteron 4 machine with 8 Gb of main memory and Mb of CPU cache. All the codes that we have written and obtained are written in C and C++, and was compiled under -O6 -funroll-loops flags on Linux kernel.4.6.

7 algorithm computes the sum explicitly and ( thus exactly. We have limited all datasets to 50K points so that true relative error, i.e. Ĝ(x q G true (x q /G true (x q, can be evaluated, and set the tolerance at % relative error for all query points. When any method fails to achieve the error tolerance in less time than twice that of the naive method, we give up. Codes for the FGT [9] and for the IFGT [4] were obtained from the authors websites. Note that both of these methods require the user to tweak parameters, while the others are automatic. 4 DFD refers to the depth-first dual-tree finite-difference method [7]. DFGT(Q, p DH = p DL = p HL = if.maxside < h, p DH = the smallest p such that N ( ( D k D k ( r p k r p p! < ǫg min ( r D D Q. if Q.maxside < h, p DL = the smallest p such that N ( ( D k D k ( r p k r p p! < ǫg min ( r D D D N ( r D Q. if max(q.maxside,.maxside < h, p HL = the smallest p such that ( ( D k (( (r p k ((r p ( (r p D k p! < ǫg min Q. c DH = p D DH N Q. c DL = p D DL N. c HL = Dp D+ HL. c Direct = DN Q N. if no Hermite coefficient of order p DH exists for, Compute it. c DH = c DH + p D DH N. if no Hermite coefficient of order p HL exists for, Compute it. c HL = c HL + p D HL N. c = min(c DH, c DL, c HL, c Direct. if c = c DH <, (Direct Hermite Evaluate each x q at the Hermite series of order p DH centered about x of using Equation. if c = c DL <, (Direct Local Accumulate each x r as the Taylor series of order p DL about the center x Q of Q using Equation. if c = c HL <, (Hermite-to-Local Convert the Hermite series of order p HL centered about x of to the Taylor series of the same order centered about x Q of Q using Lemma.. if c c Direct, Update G min and G max in Q and all its children. return. if leaf(q and leaf(, Perform the naive algorithm on every pair of points in Q and. else DFGT(Q.left,.left. DFGT(Q.left,.right. DFGT(Q.right,.left. DFGT(Q.right,.right. 4 For the FGT, note that the algorithm only ensures: bg(x q G true(x q τ. Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D =, use p = 8; for D = 3, use p = 6; set ρ x =.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table.

8 Algorithm \ scale sj (astronomy: positions, D =, N = 50000, h = Naive FGT out of AM out of AM out of AM IFGT > Naive > Naive > Naive > Naive > Naive > Naive DFD DFGT DFGTH colors50k (astronomy: colors, D =, N = 50000, h = Naive FGT out of AM out of AM out of AM > Naive > Naive IFGT > Naive > Naive > Naive > Naive > Naive > Naive DFD DFGT DFGTH edsgc-radec-rnd (astronomy: angles, D =, N = 50000, h = Naive FGT out of AM out of AM out of AM IFGT > Naive > Naive > Naive > Naive > Naive > Naive DFD DFGT DFGTH mockgalaxy-d-m-rnd (cosmology: positions, D = 3, N = 50000, h = Naive FGT out of AM out of AM out of AM out of AM > Naive > Naive > Naive IFGT > Naive > Naive > Naive > Naive > Naive > Naive > Naive DFD DFGT DFGTH bio5-rnd (biology: drug activity, D = 5, N = 50000, h = Naive FGT out of AM out of AM out of AM out of AM out of AM out of AM out of AM IFGT > Naive > Naive > Naive > Naive > Naive > Naive > Naive DFD DFGT > Naive > Naive > Naive > Naive > Naive > Naive > Naive DFGTH > Naive > Naive > Naive > Naive > Naive > Naive > Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [4] may allow an increase in dimensionality. eferences [] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(:85 03, 985. [] J. Barnes and P. Hut. A Hierarchical O(NlogN Force-Calculation Algorithm. Nature, 34, 986. [3] B. Baxter and G. oussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 4(:57 59, 00. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 6(:67 90, January 995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 3 (December 000. MIT Press, 00. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 003. [7] A. G. Gray and A. W. Moore. apid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 003, 003. [8] L. Greengard and V. okhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, (:79 94, 99. [0] L. Greengard and. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III: , 998. [] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 986. [] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, :3 39, 99. [3] O. Szász. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 5:9 34, 95. [4] C. Yang,. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 003.

Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method

Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method Dongryeol Lee Computational Science and Engineering Georgia Institute of Technology Atlanta, GA 30332 dongryel@cc.gatech.edu