APPROXIMATING GAUSSIAN PROCESSES

Size: px

Start display at page:

Download "APPROXIMATING GAUSSIAN PROCESSES"

Prosper McCarthy
5 years ago
Views:

1 1 / 23 APPROXIMATING GAUSSIAN PROCESSES WITH H 2 -MATRICES Steffen Börm 1 Jochen Garcke 2 1 Christian-Albrechts-Universität zu Kiel 2 Universität Bonn and Fraunhofer SCAI

2 2 / 23 OUTLINE 1 GAUSSIAN PROCESSES 2 HIERARCHICAL MATRICES 3 H 2 -MATRIX 4 RESULTS

3 GAUSSIAN PROCESSES given set of data S = {(x i, y i ) R d R} N i=1 assuming a Gaussian process prior on f (x), i.e values f (x) are Gaussian distributed with zero mean and covariance matrix K kernel (or covariance) function k(, ) defines K via K i,j = k(x i, x j ). typical kernel: Gaussian RBF k(x, y) = e x y 2 /w representer theorem gives solution f (x) as N f (x) = α i k(x i, x) i=1 coefficient vector α is the solution of the linear equation system (K + σ 2 I)α = y full N N matrix O(N 2 ) complexity, unfeasible for large data approximation needed, e.g. use subset of size M in computational core O(M 2 N) use iterative solver with approximation of matrix vector product Kα 3 / 23

4 4 / 23 HIERARCHICAL MATRICES data sparse approximation of kernel matrix O(Nm log N) for storage, (local rank) m controls accuracy operations like matrix-vector product, matrix multiplication or inversion can now be computed efficiently efficient computation of H-matrix approximation needed H-matrix approach developed for efficient treatment of dense matrix arising from discretization of integral operators efficient computation for 2D, 3D problems exists strongly related to fast multipole, panel clustering, fast gauss transform

5 5 / 23 1D MODEL PROBLEM in the following we present the underlying ideas in one dimension τ ϱ we look at blocks in the (permuted) matrix whose corresponding subregions have a certain 1D-distance τ ϱ employ Taylor-expansion to approximate kernel Taylor-expansion only used for explanation, but not in algorithm

6 6 / 23 PANEL CLUSTERING DEGENERATE APPROXIMATION: If k is sufficiently smooth in a subdomain τ ϱ, we can approximate by a Taylor series: k(x, y) := m 1 ν=0 (x x τ ) ν ν! ν k x ν (x τ, y) (x τ, y ϱ) FACTORIZATION: For i, j I with x i τ and x j ϱ we find K ij = k(x i, x j ) k(x i, x j ) = m 1 ν=0 (x i x τ ) ν } {{ ν! } =(A τ,ϱ) iν m 1 = (A τ,ϱ ) iν (B τ,ϱ ) jν ν=0 ν k x ν (x τ, x j ) } {{ } =(B τ,ϱ) jν

7 6 / 23 PANEL CLUSTERING DEGENERATE APPROXIMATION: If k is sufficiently smooth in a subdomain τ ϱ, we can approximate by a Taylor series: k(x, y) := m 1 ν=0 (x x τ ) ν ν! ν k x ν (x τ, y) (x τ, y ϱ) FACTORIZATION: For the sets ˆτ := {i : x i τ}, ˆϱ := {j : x j ϱ} we find K ˆτ ˆϱ A τ,ϱ B τ,ϱ Storage m(#ˆτ + #ˆϱ) instead of (#ˆτ)(#ˆϱ). RESULT: Significant reduction of storage requirements if m #ˆτ, #ˆϱ.

8 7 / 23 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) ( 2 for demonstration purposes) Start with τ = ϱ = Ω. Nothing is admissible.

9 7 / 23 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) τ and ϱ are subdivided. Still nothing is admissible.

10 7 / 23 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) τ ϱ We split the intervals again. And find an admissible block. τ dist ϱ

11 7 / 23 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) We find six admissible blocks on this level.

12 7 / 23 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) The procedure is repeated...

13 CLUSTER TREE AND BLOCK PARTITION GOAL: Split Ω Ω into subdomains satisfying the admissibility condition diam(τ) 2 dist(τ, ϱ) (up to a small remainder). The procedure is repeated until only a small subdomain remains. RESULT: Domain Ω Ω partitioned into blocks τ ϱ. Clusters τ, ϱ Ω organized in a cluster tree. 7 / 23

14 8 / 23 HIERARCHICAL MATRIX IDEA: Use low-rank approximation in all admissible blocks ˆτ ˆϱ. Standard representation of original matrix K requires N 2 units of storage.

15 8 / 23 HIERARCHICAL MATRIX IDEA: Use low-rank approximation in all admissible blocks ˆτ ˆϱ. Replace admissible block K ˆτ ˆϱ by low-rank approximation K ˆτ ˆϱ = A τ,ϱ B τ,ϱ.

16 8 / 23 HIERARCHICAL MATRIX IDEA: Use low-rank approximation in all admissible blocks ˆτ ˆϱ. Replace all admissible blocks by low-rank approximations, leave inadmissible blocks unchanged. RESULT: Hierarchical matrix approximation K of K. STORAGE REQUIREMENTS: One row of K represented by only O(m log N) units of storage, total storage requirements O(Nm log N).

17 SECOND APPROACH: CROSS APPROXIMATION OBSERVATION: If M is a rank 1 matrix and we have pivot indices i, j with M i j 0, we get the representation M = ab, a i := M ij /M i j, b j := M i j. IDEA: If M can be approximated by a rank 1 matrix, we still can find i, j with M i j 0 and M ab. HIGHER RANK: Repeating the procedure for the error matrix yields rank m approximation of arbitrary accuracy. EFFICIENT: If the pivot indices are known, only m rows and columns of M are required to construct a rank m approximation. PROBLEM: Selection of pivot indices. Efficient strategies needed. Provable in certain settings. For our case it works (but till did not work on a proof) 9 / 23

18 10 / 23 UNIFORM HIERARCHICAL MATRIX GOAL: Reduce the storage requirements. APPROACH: Expansion in both variables k(x, y) ν+µ<m yields low-rank factorization K ˆτ ˆϱ V τ S τ,ϱ V ϱ, ν+µ k x ν y µ (x τ, y ϱ ) (x x τ ) ν (y y ϱ ) µ ν! µ! (V τ ) iν := (x i x τ ) ν ν! dx, (S τ,ϱ ) νµ := ν+µ k x ν y µ (x τ, y ϱ ). IMPORTANT: V τ depends only on one cluster (τ). Only the small matrix S τ,ϱ R m m depends on both clusters.

19 11 / 23 H 2 -MATRIX IDEA: Use three-term factorization in all admissible blocks ˆτ ˆϱ. Standard representation of original matrix K requires N 2 units of storage.

20 11 / 23 H 2 -MATRIX IDEA: Use three-term factorization in all admissible blocks ˆτ ˆϱ. Replace admissible block K ˆτ ˆϱ by low-rank approximation K ˆτ ˆϱ = V τ S τ,ϱ V ϱ.

21 11 / 23 H 2 -MATRIX IDEA: Use three-term factorization in all admissible blocks ˆτ ˆϱ. Replace all admissible blocks by low-rank approximations, leave inadmissible blocks unchanged.

22 H 2 -MATRIX IDEA: Use three-term factorization in all admissible blocks ˆτ ˆϱ. Use nested representation for the cluster basis. Use transfer matrices T τ R k k with V τ ˆτ k = V τ T τ for all sons τ sons(τ) to handle cluster basis (V τ ) efficiently. RESULT: H 2 -matrix approximation K of K. STORAGE REQUIREMENTS: One row of K represented by only O(m) units of storage, total storage requirements O(Nm). 11 / 23

23 12 / 23 OVERALL ALGORITHM 1: build hierarchy of clusters t by binary space partitioning 2: build blocks t s using the admissibility criterion 3: for all admissible blocks t s do 4: use ACA to compute a low-rank approximation K t s = AB 5: end for 6: for all remaining blocks t s do 7: compute standard representation K t s = K t s 8: end for 9: coarsen the block structure adaptively with blockwise SVD 10: convert the H-matrix into an H 2 -matrix

24 13 / 23 NUMERICAL RESULTS data sets network of simple sensor motes (Intel Lab Data) predict the temperature at a mote from the measurements of neighbouring ones mote22 consists of training / 2500 test from 2 other motes mote47 has training / 2000 test from 3 nearby motes helicopter flight project predict yaw rate in next timestep based on 3 measurements heliyaw has training / 4000 test data in 3 dimensions using Gaussian RBF kernel e x y 2 /w hyperparameters w and σ were found using a 2:1 split of the training data for each data set size note: a H 2 -matrix approximation can be used for several σ disregard actual runtimes, only compare (numbers from 2007)

25 NUMERICAL RESULTS (QUALITY, SPEEDUP) stored o.t.fly (f. both) H 2 -matrix data set #data time time error time error KB/N mote mote n/a mote mote n/a heliyaw heliyaw n/a matrix for N = can (barely) be stored speedups of two orders of magnitude for large data sets twice to ten-times the speedup of related work storage reduction of more than one order of magnitude for helicopter data set from down to 6.6, or in total from about 6 GB to about 250 MB 14 / 23

26 / 23 MOTE22: H 2 -MATRIX APPROXIMATION FOR 5000 DATA difference between full matrix and H 2 -matrix is in the spectral norm

27 MOTE 22: STUDY SCALING OF APPROACHES using w = 2 9 and σ = 2 5 for different data set sizes compare runtime per iteration for the different values of N on-the-fly computation expected O(N 2 ) scaling stored matrix even worse than O(N 2 ) from to data for H 2 -matrix scaling is nearly like O(Nm) H 2 -matrix time its time/its stored matrix time its n/a time/its on-the-fly time its time/its / 23

28 MOTE22, DIFFERENT DATA SET SIZES USING OPTIMAL PARAMETERS w / σ #data w / σ stored o.t.fly error H 2 error KB/N / / / / /2 5 n/a optimal w / σ found via 2:1 split of training data observe different optimal w / σ found for each data set size need for parameter tuning on large data set runtime of H 2 -matrix starts to make an improvement against stored matrix after 5000 data points more data useful for better results 17 / 23

29 18 / 23 EFFECT OF σ ON NUMBER OF ITERATIONS using the data of mote22 results are from 2:1 split using w = 2 8 and different σ i.e. matrix size is observe how number of iterations of GMRES depends on σ σ MAE its note: with smaller w the number of iterations usually grows as well

30 19 / 23 HOW CAN ONE GO INTO HIGHER DIMENSIONS question: how many dimensions can be handled? related recent fast Gauss-transform can give insights refer approach and study from M. Griebel and D. Wissel, Fast approximation of the discrete Gauss transform in higher dimensions. Journal of Scientific Computing, their study compares FGT (Greengard and Strain, 1991) Improved FGT (Yang, Duraiswami, and Gumerov, 2003) Dual-Tree FGT (Lee, Gray, and Moore, 2005) Optimized FGT (Griebel and Wissel, 2013) study dependence of algorithms on dimension and bandwitdh

31 h 20 / 23 OBSERVATIONS FROM GRIEBEL AND WISSEL I uniform data dc 5 dc 10 dc 20 of 5 of 10 of 20 2,000 N = 200,000 1,

32 OBSERVATIONS FROM GRIEBEL AND WISSEL II real life data is not uniform dc cmo3 dc cte4 dc bio6 of cmo3 of cte4 of bio / 23

33 OBSERVATIONS FROM GRIEBEL AND WISSEL III real life data is not uniform dc shu9 dc gal21 dc aco50 of shu9 of gal21 of aco / 23

34 23 / 23 CONCLUSIONS H 2 -matrices for approximating Gaussian Processes time O(Nm log(n)), storage O(Nm) handling of large data sets how to efficiently built H 2 -matrix in higher dim s H2Lib available at References (and see references therein): Börm and Garcke, Approximating Gaussian Processes with H2-matrices. ECML 2007, pages 42-53, Griebel and Wissel, Fast approximation of the discrete Gauss transform in higher dimensions. Journal of Scientific Computing, 55(1): , 2013.

H 2 -matrices with adaptive bases

1 H 2 -matrices with adaptive bases Steffen Börm MPI für Mathematik in den Naturwissenschaften Inselstraße 22 26, 04103 Leipzig http://www.mis.mpg.de/ Problem 2 Goal: Treat certain large dense matrices