Fast Direct Methods for Gaussian Processes

Size: px

Start display at page:

Download "Fast Direct Methods for Gaussian Processes"

Shannon Cooper
5 years ago
Views:

1 Fast Direct Methods for Gaussian Processes Mike O Neil Departments of Mathematics New York University oneil@cims.nyu.edu December 12,

2 Collaborators This is joint work with: Siva Ambikasaran Dan Foreman-Mackey Leslie Greengard David Hogg Sunli Tang (ICTS, Tata Institute) (UW Astronomy) (NYU Math, Simons Fnd.) (NYU Physics) (NYU Math) 2

3 The short story: Results I will present a fast, direct (not iterative), algorithm for rapidly evaluating likelihood functions of the form: L( ) / 1 (det C) 1/2 e y t C 1 y These likelihood functions universally occur when modeling under Gaussian process priors. 3

4 Outline Probability review Gaussian process overview Applications: Regression & prediction Computational bottlenecks A fast algorithm Numerical results Other and future work 4

5 Multivariate normal The n-dimensional normal distribution has the following properties: Z P ( ~X 2!) =! 1 (2 ) n/2 det C 1/2 e (~y ~µ )t C 1 (~y ~µ )/2 d~y C is a positive-definite covariance matrix C ij = Cov(X i,x j ) 5

6 Stochastic processes Stochastic processes are simply the generalization of random variables to random functions 6

7 Gaussian processes are stochastic processes 7

8 Gaussian processes Random functions with structured covariance 8

9 Gaussian processes We say that f (x) GP(m(x),k(x,x 0 )) if and only if f (x 1 ),...,f(x n ) N(m(x),C) m =(m(x 1 ) m(x n )) t C ij = k(x i,x j ) The function k is the covariance kernel 9

10 Gaussian processes This means that each realization of a Gaussian process is a function - the corresponding probability distribution is over functions. This function sampled at a selection of points behaves like a multivariate Normal distribution Each marginal distribution is normal:f (x i ) N(m(x i ),C ii ) 10

11 Role of the covariance kernel Admissible covariance kernels must give rise to positive-definite covariance matrices: y t K(x, x)y > 0 for all choices of x, y. Common choices are: k(x,x 0 )=e x x 0 /a k(x,x 0 )=e (x x 0 ) 2 /b k(x,x 0 )= x x 0 p 2 x x 0 K c c 11

12 Many options for the covariance kernel Figure : K (x), 0.1 apple x apple 5, 0 apple apple 4. (c) Nist Handbook

13 Role of the covariance kernel The covariance kernel controls the regularity and the variability of the Gaussian process. k(x,x 0 )=e (x x 0 ) 2 /(20) k(x,x 0 )=e (x x 0 ) 2 /(0.2) (95% Conditional confidence intervals are shaded) 13

14 GP s in action: Prediction Imagine we have data {x,y}, and assume a model of the form: y i = f (x i )+ i Furthermore, assume that f is a Gaussian process and that i is i.i.d. Gaussian noise (priors): f GP(m,k ) N(0, 2 ) We ve assumed that f can depend on some other set of parameters, which we may fit later. 14

15 Prediction This implies that y is also a Gaussian process: y(x) GP(m (x), (x x 0 ) 2 + k (x,x 0 )) The new covariance matrix is: C = 2 I + K 15

16 Prediction Assume m(x) = 0. Then the joint distribution of y and a new predicted value y is: apple y y N 0, apple C k(x,x ) k(x, x) k(x,x ) 16

17 Prediction The conditional distribution (posterior) of the predicted value can then be calculated as: y x, y,x N(µ, 2 ) where µ = k(x, x)c 1 y 2 = k(x,x ) k(x, x)c 1 k(x,x ) Schur complement of C 17

18 What about the extra parameters? In general, k = k(x,x 0 ; 1 ) m = m(x; 2 ) The parameters have to be fit to the data (inference), or marginalized away (integrated out). 18

19 Computational bottlenecks There are three main computational bottlenecks in dealing with large covariance matrices: Computation/application of Computation of log det C C 1 Symmetric factorization C = W t W L( ) = e y t C 1 (x)y/2 (2 ) n/2 det C (x) 1/2 19

20 Alternative existing methods To avoid large-scale dense linear algebra, various techniques are often used to approximate the inverse and determinant: Element thresholding (banded/sparse matrix) Special-case kernels (exponential) Subsampling Iterative methods In many cases, these methods sacrifice the fidelity of the underlying model. 20

21 Hierarchical matrix compression We want a data-sparse factorization of the covariance matrix which allows for fast inversion: means that A = A 1 A 2 A m A 1 = Am 1 Am 1 1 A 1 1 det A =deta 1 det A 2 det A m Many factorizations can be used: HODLR, HBS, HSS, etc. 21

22 Hierarchical matrix compression What properties of the Gaussian process covariance matrix allow for it to be rapidly factored? For example, most covariance kernels yield matrices which are Hierarchically Off-Diagonal Low-Rank Full rank; Low rank; 22

23 Hierarchical matrix compression The resulting (numerical) factorization will have the form: = K (3) K 3 K 2 K 1 K 0 Full rank; Low-rank; Identity matrix; Zero matrix; Physical interpretation? 23

24 Motivation: Physics 24 F = G m 1m 2 r 2

25 Motivation: Physics Calculations required: (In FINITE arithmetic) O(N) Fast Multipole Methods 25 F = G m 1m 2 r 2

26 Motivation: Physics Similar algorithms exist for Heat Flow, where the distribution looks like a Gaussian: Temperature Distance 26

27 Connection with statistics Time series data which occurs close in time has larger relationship than data occurring far apart. 1.0 Often this covariance is modeled using the heat kernel: Temperature Distance

28 A 1-level scheme - in pictures We can easily show how the factorization works for a 1-level scheme: = x Full rank; Low-rank; Identity matrix; Zero matrix; 28

29 A 1-level scheme - in formulae We can easily show how the factorization works for a 1-level scheme: A 11 UV t A 11 0 I A 1 11 UV t = x VU t A 22 0 A 22 A 1 22 VUt I Assume we have the factors U and V for now. 29

30 A 2-level scheme - in formulae A 2-level factorization is slightly more complicated A 11 V 1 U t 1 U 1 V t 1 A 22 VU t UV t A 33 V 2 U t 2 U 2 V t 2 A 44 = A A 22 A A 44 x I A 1 22 V 1U t 1 0 A11 1 U 1V1 t I I A 1 44 V 2U t 2 0 A33 1 U 2V2 t I x I A 1 2 VUt A 1 1 UV t I U 1 V t 1 A where A 11 1 = and = A 22 U 2 V t 2 A 2 A 33 A 44 V 1 U t 1 V 2 U t 2 30

31 A 2-level scheme - in formulae How do we invert A A 11 1 = U 1 V t 1? V 1 U t 1 A 22 = apple A A 22 + apple U1 0 0 V 1 apple 0 V t 1 U1 t 0 We can use the Sherman-Morrison-Woodbury formula: (A + UV ) 1 = A 1 A 1 U I + VA 1 U 1 VA 1 If U,V are rank k, this is a k x k matrix 31

32 A log(n)-level scheme Proceeding recursively = K (3) K 3 K 2 K 1 K 0 Full rank; Low-rank; Identity matrix; Zero matrix; 32

33 Low-rank factorizations Two classes of techniques: Linear algebraic SVD, LU, LSR, etc. Analytic Explicit outer product Most linear algebraic dense methods scale as O(n 2 k). Analytic (polynomial) interpolation of the kernel can scale as O(nk 2 ). Special function expansions may be fastest. 33

34 Analytic approximation For example, the squared-exponential (Gaussian) function can be written as: e (t s)2 /` = e t2 /` 1X n=0 1 n! sp ` n H n t p ` Similar (approximate) expansions can be calculated for most kernels in terms of other special functions. 34

35 Analytic approximation Matérn kernels: k(r) = 21 ( ) p 2 r ` K p 2 r ` Bessel functions obey the addition formula: K (w z) = 1X j= 1 ( 1) j K +j (w) I j (z) 35

36 High-order interpolation When given a form of the covariance kernel, we have the ability to evaluate it anywhere, including at locations other than those supplied in the data: K(x,y) px i=1 px j=1 K(x i,y j ) L j (x) L k (y) Choose interpolation nodes to be Chebyshev nodes for stable evaluation, and formulaic factorizations. 36

37 Computational complexity Using ACA (LU) with randomized accuracy checking or Chebyshev interpolation, we can factor A = UV in O(np 2 ) time, p is the numerical rank. Compute all low-rank factorizations: O(np 2 ) On level k, we have to update k other U blocks: O(knp 2 ) log n X k=1 O(kp 2 n)=o(p 2 n log 2 n) Inversion and determinant are both: O(p 2 n log n) 37

38 Fast determinant calculation Rapidly calculating the determinant of this factorization relies on Sylvester s Determinant Theorem. Theorem 4.1 (Sylvester s Determinant Theorem). If A 2 R m n and B 2 R n m, then det (I m + AB) =det(i n + BA), where I k 2 R k k is the identity matrix. In particular, for a rank p update to the identity matrix, det(i n + U n p V p n )=det(i p + V p n U n p ). The dense determinant of a p x p matrix requires calculations. O(p 3 ) 38

39 Numerical results Time taken in seconds n Assembly Factor Solve Det. Error Solve Time System Size Direct 1D Fast 2D Fast 3D Fast Timings for one-dimensional Gaussian covariance functions. The matrix entry is given as C ij = ij +exp r i r j 2, where r i are random uniformly distributed points in the interval [ 3, 3]. (Results computed on a MacBook Air) 39

40 Numerical results Time taken in seconds n Assembly Factor Solve Det. Error Solve Time System Size Direct Exponential Inverse Multi-quadric Biharmonic Timings for one-dimensional biharmonic covariance function. The matrix entry is given as C ij =2 ij + r i r j 2 log( r i r j ), where r i are random uniformly distributed points in the interval [ 3, 3]. (Results computed on a MacBook Air) 40

41 Versus existing Python packages 41

42 Random variate sampling To sample from a high-dimensional Gaussian distribution it is necessary to apply a symmetric factor of C. If z = N (0,I) then x = C 1/2 z N (0,C) where C = C 1/2 C 1/2 = LL T = WW T 42

43 Hierarchical Symmetric Factorization We wish to obtain a factorization of the form: A = A apple A apple 1 A apple {z 2 A 0 } W W T z } { A T 0 A T apple 2A T apple 1A T apple, = x A 3 A 2 A 1 A 0 A 0 A 1 A 2 A 3 43

44 In conclusion We just presented fast direct algorithms for computing with Gaussian processes. A certain class of algorithms can be used to efficiently manipulate them: Hierarchical factorization of covariance matrices Inversion Determinant Symmetric factorization 44

45 Available Python software Download from 45

46 Other activities Accelerated HSS-HBS matrix factorizations Custom covariances Efficient stable distribution calculations Symmetric factorizations in physics (mobility) Even more activities Fast algorithms for PDEs and integral equations in classical mathematical physics Electromagnetics, fluids, plasma physics, etc. 46

gaussianprocess.org/gpml/ For a nice Python package: dan.iel.

47 Thanks! Reference papers: The best Gaussian process reference: For a nice Python package: dan.iel.fm/george/ Funding: AFSOR NSSEFF Program Award FA AIG-NYU Award #A

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature