Stable Parameterization Schemes for Gaussians

Size: px

Start display at page:

Download "Stable Parameterization Schemes for Gaussians"

Myron Hunter
5 years ago
Views:

1 Stable Parameterization Schemes for Gaussians Michael McCourt Department of Mathematical and Statistical Sciences University of Colorado Denver ICOSAHOM 2014 Salt Lake City June 24, 2014 (UCDenver) Parameterizing Gaussians June 24, / 28

2 Special Thanks I would like to offer my thanks to Rodrigo Platte for the invitation to this session, and to all of you in attendance for your attention. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

Introduction The shape parameter ε associated with some RBFs plays an

Example Gaussian support vector machine decision contour ε = 10, Too

1, Insensitive Comparison with the true pattern allows for judgment.

3 Introduction The shape parameter ε associated with some RBFs plays an important role in the quality of associated approximations. Example Gaussian support vector machine decision contour ε = 10, Too Erratic ε = 1, Nice ε =.1, Insensitive Comparison with the true pattern allows for judgment. How can the quality be judged without the true pattern? michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

4 Kernel-based Interpolation Basic strategies for parameterizing kernels are based on: Error bounds ([9] through [10]) Stationary interpolation strategies (e.g., [13], or [7]) Trial and error (almost everyone) Today, we focus on statistical approaches to optimizing ε, generally based on the interpretation of scattered data interpretation through spatial statistics or data assimilation. We restrict our discussion to Gaussian kernels K (x, z) = e ε2 x z 2 ; most results apply to other positive definite kernels as well. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

5 Statistical Parameterization Schemes By using spatial statistics to analyze data, we open up avenues: Kriging variance, a.k.a, the power function Golomb-Weinberger bound Maximum likelihood estimation Maximum a posteriori estimation k-fold cross-validation (not discussed here) Generalized cross-validation (also not discussed) Goal In this talk, we will look at how these quantities can be computed for small ε values using a stable basis for Gaussians [3]. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

6 Hilbert-Schmidt Eigenexpansion The term Hilbert-Schmidt SVD (HS-SVD) was introduced in [2], though the idea originated in [5]. We recall the Hilbert-Schmidt eigenvalue problem, K (x, z)ϕ n (z)ρ(z)dz = λ n ϕ n (x), which allows us to write a positive definite kernel as K (x, z) = λ n ϕ n (x)ϕ n (z), n =d n is a multiindex. By exploiting this series, the standard basis k(x) T = ( K (x, x 1 ) K (x, x N ) ) can be expressed using a stable basis ψ(x) T. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

7 Hilbert-Schmidt Eigenexpansion for Gaussians Gaussians in d dimensions have the eigenexpansion α λ n = (ε 2 + δ 2 + α 2 ) (2 n d+1/2) ε2( n d), ϕ n (x) = γ n e δ2 x 2 H n 1 (βαx) α is a free parameter (another one...), and β, γ n, δ 2 are defined by ε and α; ignore them. Important Limits For small ε, the eigenvalues decay quickly and the eigenfunctions approach Hermite polynomials. lim λ n ε 2( n d) ε 0 lim ϕ n (x) H n 1 (αx) ε 0 Designer Kernels Work such as [15] and [4] suggest constructing kernels by choosing eigenfunctions to satisfy certain properties. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

8 Eigenfunction Decompositions and Matrix Manipulations The N N kernel matrix K can be written as the infinite matrix product ( ) ( ) K (x, z) = λ n ϕ n (x)ϕ n (z) K = ΦΛΦ T Λ1 Φ T = Φ 1 Λ 2 Φ T, n =d 2 where Φ 1, Λ 1 are N N and invertible. The Hilbert-Schmidt eigenvalues appear in nonincreasing order on the diagonal of Λ. We may write K = ΨΛ 1 Φ T 1 K = Φ ( Λ1 Φ T 1 Λ 2 Φ T 2 ) ( ) I = Φ N Λ 2 Φ T Λ 1 Φ T 2 Φ T 1 Λ 1 1 = ΨΛ 1Φ T 1. 1 is the Hilbert-Schmidt SVD, or just HS-SVD. Ψ is made of the eigenfunctions plus a small correction Ψ = ( ) ( ) I Φ 1 Φ N 2 Λ 2 Φ T = Φ 1 + Φ 2 Λ 2 Φ T 2 Φ T 1 Λ 1 2 Φ T 1 Λ michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

9 Stable Interpolation Through the Hilbert-Schmidt SVD Forgoing the details, the standard basis and stable basis are related by k(x) T = ψ(x) T Λ 1 Φ T 1, K = ΨΛ 1Φ T 1. k(x i ) T and ψ(x i ) T are the ith rows of K and Ψ respectively. Evaluating the interpolant s at the point x can now be done stably: s(x) = k(x) T K 1 y = ψ(x) T Λ 1 Φ T 1 Φ T 1 Λ 1 1 Ψ 1 y = ψ(x) T Ψ 1 y. This new basis ψ is more stable than k because the Λ 1 term is eliminated analytically. For small ε, the condition of Λ 1 becomes untenable. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

10 More About the HS-SVD Notes This strategy is often referred to as RBF-QR when Φ T 2 Φ T 1 is computed through a QR factorization. The vector k(x) T K 1 = ψ(x) T Ψ 1 contains the Cardinal functions. We are confident in the existence of a HS-SVD. Uniqueness is another question... Strategy We want to study the role that this idea plays in the objective functions associated with various parameterization schemes. Most of our work today involves exploiting the HS-SVD to try and open up the range of viable ε values for our Gaussians. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

11 HS-SVD of K is NOT a Symmetric Decomposition Interesting Point Despite the SPD nature of K, ΨΛ 1 Φ T 1 is NOT a symmetric decomposition, i.e., Ψ Φ 1. While surprising, this does not severely hinder analysis. To get some practice working with the HS-SVD, let us study Ψ 1 Φ 1 and carefully apply inverses: Ψ 1 Φ 1 = (Φ 1 + Φ 2 Λ 2 Φ T 2 Φ T 1 Λ 1 1 ) 1 Φ 1 = (I N + Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 Λ 1 1 ) 1 = Λ 1 (Λ 1 + Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 ) 1 = Λ 1 (Λ 1 + G T G) 1, where G = Λ 1/2 2 Φ T 2 Φ T 1. Clearly, Ψ 1 Φ 1 is symmetric, since it is a product of symmetric matrices. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

12 HS-SVD of K Approaches a Symmetric Decomposition as ε 0 We know that Ψ 1 Φ 1 = Λ 1 (Λ 1 + G T G) 1, G = Λ 1/2 2 Φ T 2 Φ T 1. We can study the magnitude of these components: lim Λ 1 2 = ε 0 = 1, ε 0 ( ( ) ) lim ε 0 GT G 2 = lim Λ 1/2 ε 0 2 Φ T 2 Φ T O 2 ε N1/d. This suggests that (Λ 1 + G T G) 1 Λ 1 and Ψ 1 Φ 1 Λ 1 Λ 1 1 = I N, as ε 0. Applied to Parameterization Schemes Enough practice... let s see how this really works. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

13 Kernel-based Interpolation as Kriging Two strategies for studying scattered data {x i, y i } N i=1 {X, y}: Numerical Analysis Data was generated by a function from the RKHS induced by K, H K. s(x) = k(x) T K 1 y, y(x) s(x) y HK K (x, x) k(x) T K 1 k(x) Spatial Statistics Data was realized from a zero-mean Gaussian random field Y with covariance K. ( ) Y x {X, y} N k(x) T K 1 y, K (x, x) k(x) T K 1 k(x) Note that V (x) = P(x) 2 = K (x, x) k(x) T K 1 k(x). michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

14 Kriging Variance Minimizing V (x) would, in a sense, yield the most trustworthy predictions. Its computation can be subject to ill-conditioning as ε 0 V (x) = K (x, x) k(x) T K 1 k(x) = K (x, x) ψ(x) T Ψ 1 k(x) Recall the Cardinal functions: k(x) T K 1 = ψ(x) T Ψ 1. Better stability, but numerical cancelation is unavoidable. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

15 Golomb-Weinberger Bound The standard native space error bound can be augmented into the Golomb-Weinberger bound [6] y(x) s(x) y HK V (x) δ s HK V (x), for some small δ value related to the quality of the interpolation. Ignoring the fact that V (x) is still limited by cancelation, we must also compute s HK successfully to evaluate this bound. Recall that, through the reproducing property, s 2 H K = s, s HK = y T K 1 k( ), k( ) T K 1 y HK = y T K 1 k( ), k( ) T HK K 1 y = y T K 1 KK 1 y = y T K 1 y. We will study this using the HS-SVD. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

16 The Hilbert Space Norm (Mahalanobis Distance) y T K 1 y Define Ψb = y, the system for the stable basis interpolation coefficients. Using this and K = ΨΛ 1 Φ T 1 in y T K 1 y gives y T K 1 y = b T Ψ T Φ T 1 Λ 1 1 Ψ 1 Ψb = b T (Ψ 1 Φ 1 ) T Λ 1 1 b Our old friend Ψ 1 Φ 1 = (I N + Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 Λ 1 1 ) 1 returns! Canceling out the double inverse and applying the transpose gives y T K 1 y = b T (I N + Λ 1 1 Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 )Λ 1 1 b = b T Λ 1 1 b + bt Λ 1 1 Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 Λ 1 1 b = b T Λ 1 1 b + (Λ1/2 2 Φ T 2 Φ T 1 Λ 1 1 b)t (Λ 1/2 2 Φ T 2 Φ T 1 Λ 1 1 b) b T Λ 1 1 b Computing this quantity is feasible, although it grows quickly. The bound in the last line may be rather tight for small ε. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

17 The Hilbert Space Norm - A Tight Bound Again considering Gaussians with small ε, lim Λ 1/2 ε 0 2 Φ T 2 Φ T 1 Λ 1 1 b 2 Λ 1/2 lim ε 0 This lets us say that as compared to lim Λ 1/2 ε 0 2 Φ T 2 Φ T 1 Λ 1 b T Λ 1 1 b = Λ 1/2 1 b 2 Φ T 2 Φ T 1 Λ 1 b 2 ε 1+N1/d ε 2N1/d b 2 = ε 1 N1/d b 2 1 b 2 2 ε2 2N1/d 1 b Λ 1/2 2 2 b 2 2 = ε 2N1/d b 2 2. Thus we may compute, or at least bound, s HK = y T K 1 y. 1 michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

18 Maximum Likelihood Estimation Even though cancelation in the Kriging variance prevents us from leveraging y T K 1 y in the Golomb-Weinberger bound, it can be used in the maximum likelihood estimator [11]. Saving some time, we write that the MLE ε is the ε which minimizes C MLE (ε) = N log(y T K 1 y) + log det K. Using the HS-SVD, we know that log det K = log det(ψλ 1 Φ T 1 ) = log det Ψ + log det Λ 1 + log det Φ 1. This gives us a tool for parameterizing Gaussians with small ε. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

19 MLE Example Example: N = 169 Halton points fit to log(3 + x + y) + 2(1 + x + y) cos(2xy) on [ 1, 1] 2. Primary bound: b T Λ 1 b Correction: Λ 1/2 2 Φ T 2 Φ T 1 Λ 1 1 b 2 2 michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

20 Maximum A Posteriori Estimation When prior beliefs regarding an optimal ε value are available, they can supplement the likelihood function. This idea is discussed in general in, e.g., [8], and in the context of Gaussian Processes in [12] and [14]. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

21 Conclusion Results Repeating the strategy for interpolation, the HS-SVD may provide opportunities to study Gaussian parameterization schemes for small ε. The Hilbert space norm, MLE and Cross-validation (not discussed) are viable for small ε. Other PD kernels will work as well if their eigenexpansion is available or can be approximated. To Be Accomplished Kriging variance is better but still troubled. Reduce computational cost. Consider incorporation with matrix-free approaches, e.g., [1]. Tests on applications. (UCDenver) Parameterizing Gaussians June 24, / 28

22 Appendix: k-fold Cross-Validation Cross-Validation Strategy Using subsets of your data X I, predict values at omitted points X O and use the residual to measure the quality of that parameterization. If we block up our kernel system Kc = y matrix as ( ) ( ) KII K K = IO y, y = I, c = K OI K OO y O ( ci such that KA = I, the residual at X O is y O K OI K 1. y I Summing all the residuals over k disjoint omitted point sets O = {X (1) O,..., X (k) would produce C CV (ε; O) = y O K OI K 1 II y I. X O O This is what we want to minimize, but K 1 II II c O ), may be dangerous. O } michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

23 Appendix: k-fold Cross-Validation with HS-SVD Rewriting the blocks of K with the HS-SVD gives ( ) ( KII K K = IO ΨII Λ = I Φ T I Ψ IO Λ O Φ T ) O K OI K OO Ψ OI Λ I Φ T I Ψ OO Λ O Φ T O Notably, it says K II = Ψ II Λ I Φ T I residual y O K OI K 1 : y I II and K OI = Ψ OI Λ I Φ T I. Substituting into the K OI K 1 II = Ψ II Λ I Φ T I Φ T I Λ 1 I Ψ 1 OI = Ψ OI Ψ 1 II gives the stable residual (unsurprisingly) as y O Ψ OI Ψ 1. y I II michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

24 Appendix: Cross-Validation, Rippa Trick Define the blocked kernel system, Kc = y c = Ay: ( ) ( ) ( ) KII K K = IO AII A, A = IO y, y = I, c = K OI K OO A OI A OO This says that c O = A OI y I + A OO y O, and AK = I gives us A OI = A OO K T IO K 1 II, so y O c O = A OO K OI K 1 II y I + A OO y O, A 1 OO c O = K OI K 1 II y I + y O. Thus, the residual y O K OI K 1 II y I = A 1 OO c O. ( ci c O ). michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

25 Appendix: Cross-Validation, Rippa Trick for HS-SVD Rewriting the blocks of K with the HS-SVD gives ( ) ( ) ( KII K K = IO ΨII Ψ = IO ΛI Φ T I K OI K OO Ψ OI Ψ OO Λ O Φ T O ), so, by absorbing the ΛΦ matrix into c, Kc = y becomes ( ) ( ) ( ) ( ) ( ) ( ) ΨII Ψ IO bi y = I bi AII A = IO y I Ψ OI Ψ OO b O y O b O A OI where BΨ = I. The absorption gives us c O = Φ T O Λ 1 A OO stare long enough, you can show A OO = Φ T O Λ 1 O B OO, so A 1 OO c O = B 1 OO b O. y O O b O and, if you michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

26 Appendix: Kriging Variance - A Deeper Exploration The HS-SVD helps avoid instability, but cannot circumvent numerical cancelation - thus the limit towards machine precision. This brief example shows the basic idea behind using the HS-SVD; now another example to more efficiently compute V (x). Using k(x) T = ψ(x) T Λ 1 Φ T 1 and K = ΨΛ 1Φ T 1 in k(x)t K 1 k(x) gives k(x) T K 1 k(x) = ( ) ( ) 1 ψ(x) T Λ 1 Φ T 1 ΨΛ 1 Φ T 1 Φ1 Λ 1 ψ(x) = ψ(x) T Ψ 1 Φ 1 Λ 1 ψ(x) Aha! We recognize Ψ 1 Φ 1 from our earlier practice (practice makes perfect). michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

27 Appendix: Kriging Variance - Fun With Matrices We use the result, Ψ 1 Φ 1 = Λ 1 (Λ 1 + Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 ) 1, in our current expression: k(x) T K 1 k(x) = ψ(x) T Ψ 1 Φ 1 Λ 1 ψ(x) = ψ(x) T Λ 1 (Λ 1 + Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 ) 1 Λ 1 ψ(x) We can pull Λ 1/2 from both sides of the inverse to write this as: k(x) T K 1 k(x)=ψ(x) T Λ 1/2 1 (I N +Λ 1/2 1 Φ 1 1 Φ 2Λ 2 Φ T 2 Φ T 1 Λ 1/2 1 ) 1 Λ 1/2 1 ψ(x) =ψ(x) T Λ 1/2 (I N + LL T ) 1 Λ 1/2 ψ(x) where L = Λ 1/2 2 Φ 2 Φ 1 1 Λ 1/2 1. As ε 0, for Gaussians, LL T 2 L 2 Λ 2 1/2 2 Φ 2 Φ 1 Λ 1/2 2 2 ε2n1/d ε 2(N1/d 1) = ε 2. So for small ε, a von Neumann series allows us to write k(x) T K 1 k(x) ψ(x) T Λψ(x) ψ(x) T Λ 1/2 LL T Λ 1/2 ψ(x) michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

28 Appendix: Kriging Variance - Reduced Computation So, for small ε, we can write k(x) T K 1 k(x) ψ(x) T Λψ(x) ψ(x) T Λ 1/2 LL T Λ 1/2 ψ(x) +... This allows us to evaluate the Kriging variance for small ε with V (x) = K (x, x) k(x) T K 1 k(x) K (x, x) ψ(x) T Λ 1 ψ(x) + ψ(x) T Λ 1/2 LL T Λ 1/2 ψ(x)... After forming the stable basis ψ, no further matrix inverses (such as ψ(x) T Ψ 1 ) need be computed. Of course, this will not bypass the numerical cancelation, but it can reduce the cost. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

29 M. Anitescu, J. Chen, and L. Wang. A matrix-free approach for solving the parametric Gaussian process maximum likelihood problem. SIAM Journal on Scientific Computing, 34(1):A240 A262, January R. Cavoretto, G. E. Fasshauer, and M. McCourt. An introduction to the Hilbert-Schmidt SVD using iterated Brownian bridge kernels. Numerical Algorithms, pages 1 30, G. E. Fasshauer and M. J. McCourt. Stable evaluation of Gaussian radial basis function interpolants. SIAM J. Sci. Comput., 34(2):A737 A762, G. E. Fasshauer and Q. Ye. Reproducing kernels of generalized Sobolev spaces via a Green function approach with distributional operators. Numerische Mathematik, 119: , June michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

30 B. Fornberg and C. Piret. A stable algorithm for flat radial basis functions on a sphere. SIAM J. Sci. Comput., 30(1):60 80, M. Golomb and H. F. Weinberger. Optimal approximation and error bounds. In R. E. Langer, editor, On Numerical Approximation, pages University of Wisconsin Press, A. Iske. Multiresolution Methods in Scattered Data Modelling, volume 37 of Lecture Notes in Computational Science and Engineering. Springer, Berlin, J. S. Liu. Monte Carlo strategies in scientific computing. Springer, L.-T. Luh. michael.mccourt@ucdenver.edu (UCDenver) Parameterizing Gaussians June 24, / 28

31 The shape parameter in the Gaussian function. Computers & Mathematics with Applications, 63(3): , W. R. Madych. Miscellaneous error bounds for multiquadric and related interpolators. Computers & Mathematics with Applications, 24(12): , K. V. Mardia and R. J. Marshall. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika, 71(1): , B. Nagy, J. L. Loeppky, and W. J. Welch. Fast Bayesian inference for Gaussian process models. Technical report, Technical Report 230, Dept. Statistics, Univ. British Columbia, (UCDenver) Parameterizing Gaussians June 24, / 28

32 R. Schaback and H. Wendland. Kernel techniques: From machine learning to meshless methods. Acta Numerica, 15: , C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(12): , Barbara Zwicknagl and Robert Schaback. Interpolation and approximation in Taylor spaces. Journal of Approximation Theory, 171:65 83, July (UCDenver) Parameterizing Gaussians June 24, / 28

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt. SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University