Fast large scale Gaussian process regression using the improved fast Gauss transform

Size: px

Start display at page:

Download "Fast large scale Gaussian process regression using the improved fast Gauss transform"

Cecily Martin
5 years ago
Views:

1 Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science and Institute for Advanced Computer Studies University of Maryland, CollegePark, MD Gaussian processes allow the treatment of non-linear non-parametric regression problems in a Bayesian framework. However the computational cost of training such a model with N examples scales as O(N 3 ). Iterative methods for the solution of linear systems can bring this cost down to O(N 2 ), which is still prohibitive for large data sets. In this paper we use an ɛ-exact approximation technique, the improved fast Gauss transform to reduce the computational complexity to O(N), for the squared exponential covariance function. Using the theory of inexact Krylov subspace methods we show how to adaptively increase ɛ as each iteration. For prediction at M points the computational complexity is reduced from O(NM) to O(M + N). We demonstrate the speedup achieved on large synthetic and real data sets. We also show how the hyperparameters can be chosen in O(N) time. Unlike methods which rely on selecting a subset of the data set we use all the available data and still get substantial speedups. Categories and Subject Descriptors: [CS-TR-xxxx/UMIACS-TR-2005-xx]: April 8, 2006

2 2 Raykar and Duraiswami Contents 1 Introduction Novel contributions Relation to previous work The Gaussian process model 4 3 Computational and space complexity Training Conjugate gradient method ɛ-exact matrix-vector multiplication Adaptively choosing ɛ Mean prediction Predictive variances Regression Experiments 10 5 Choosing the hyperparameters Derivatives Computing the trace Conclusion 16

3 Fast GPR using IFGT 3 1. INTRODUCTION A crucial task in supervised learning is regression, where a predictive relationship between the inputs and outputs must be found. Given a set of N i.i.d. training examples T = {x i R d, y i R} N i=1 drawn from an unknown distribution, the task of regression is to predict the output f = f(x ) for a new input value x R d. The function, f : T R d R is called the regression function. The parametric approach assumes a functional form for f, and then estimates the unknown parameters. However, unless the form of the function is known a priori, assuming a certain form very often leads to erroneous inference. On the other hand nonparametric methods do not make any assumptions on the form of the underlying function. The Gaussian process model [Rasmussen and Williams 2006; Seeger 2004] provides a powerful framework to handle the nonparametric regression problems in the Bayesian paradigm. The regression function is represented by an ensemble of functions, on which we place a Gaussian prior. This prior is updated in the light of the training data. As a result we obtain predictions together with valid estimates of uncertainty (See Figure 1 for an illustration). The computational complexity of training a Gaussian process model with N training examples scales as O(N 3 ). The space complexity is O(N 2 ). This makes training prohibitively expensive even for moderate sized datasets (typically a few thousands). The core computational bottleneck is the inversion of the N N Gram matrix. Iterative methods like conjugate-gradient can reduce the computational complexity to O(N 2 ). However the quadratic complexity is still prohibitive for large datasets. The dominant O(N 2 ) complexity in the conjugate-gradient procedure is due to the matrix vector multiplication of the Gram matrix with a vector. A Gaussian process is completely specified by its mean and covariance functions. Different forms of covariance function gives us the flexibility to model different kinds of generative processes. One of the most popular covariance function used is the negative squared exponential (Gaussian). For the Gaussian covariance function we accelerate the matrix vector multiplication using the improved fast Gauss transform (IFGT) [Raykar et al. 2005; Yang et al. 2005]. This reduces the computational and the storage costs to O(N). The algorithm is ɛ exact in the sense that the constant hidden in O(N), depends on the desired accuracy, which however can be arbitrary. In fact for machine precision accuracy there is no difference between the direct and the fast methods. 1.1 Novel contributions Following are the novel contributions of this paper. Training time for Gaussian process regression is reduced to linear O(N) by using the conjugate-gradient method coupled with the IFGT. The prediction time per test input is reduced to O(1) and the time to compute the variance is reduced to O(N). Using results from the theory of inexact Krylov subspace methods we show that the matrix-vector product may be performed in an increasingly inexact manner as the iteration progresses and still allow convergence to the solution. In the end we also show how the hyperparameters can be chosen in O(N). The dominant computation involves computing the sums of polynomial times Gaussian, for which we develop an ɛ-exact approximation algorithm based on

4 4 Raykar and Duraiswami ideas similar to the IFGT. Trace computation is done using a randomized algorithm. 1.2 Relation to previous work Various schemes have been proposed to make Gaussian process regression computationally tractable [Williams and Seeger 2001; Smola and Bartlett 2001; Fine and Scheinberg 2001; Lawrence et al. 2003; Csato and Opper 2002; Tresp 2000; Tipping 2001]. A good review can be found in Chapter 8 of [Rasmussen and Williams 2006]. Most of the proposed schemes are based on using a representative subset of the training examples of size m n. The training time is generally O(m 2 N). Various schemes specify how the effectively choose the subset. Unlike methods which rely on choosing a subset of the dataset we use all the available points and still achieve O(N) complexity. Also in these methods there is no guarantee on the approximation of the kernel matrix in a deterministic sense. It should be noted that most of the above schemes can be further accelerated to O(mN) using the proposed method. A related approach similar to our proposed method is that of [Shen et al. 2005], who use kd-trees to speed up the matrix-vector multiplication. The IFGT gives good speedup for large bandwidth kernels. For small bandwidths dual-tree methods [Deng and Moore 1995; Gray and Moore 2003] can be used. 2. THE GAUSSIAN PROCESS MODEL The simplest most often used model for regression [Williams and Rasmussen 1996] is y = f(x) + ε, where f(x) is a zero-mean Gaussian process with covariance function K(x, x ) : R d R d R and ε is independent zero-mean normally distributed noise with variance σ 2, i.e., N (0, σ 2 ). Therefore the observation process y(x) is a zero-mean Gaussian process with covariance function K(x, x ) + σ 2 δ(x, x ). Given the training data T = {x i, y i } N i=1 the N N covariance matrix K is defined as [K] ij = K(x i, x j ). If we define the vector y = [y 1,..., y N ] T then y is a zero-mean multivariate Gaussian with covariance matrix K + σ 2 I. Given the training data T and a new input x our task is to compute the posterior p(f x, T). Observing that the joint density p(f, y) is a multivariate Gaussian the posterior density p(f x, T) can be shown to be [Rasmussen and Williams 2006] p(f x, T) N ( k(x ) T (K + σ 2 I) 1 y, K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ) ), where k(x ) = [K(x, x 1 ),..., K(x, x N )] T. If we define then the mean prediction is and the variance associated with the prediction is ξ = (K + σ 2 I) 1 y, (1) E[f ] = k(x ) T ξ, (2) Var[f ] = K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ). (3) The covariance function has to be chosen to reflect the prior information we have about the nature of the problem. For high-dimensional problems in the absence of any prior knowledge the isotropic negative squared exponential (Gaussian) is the

5 Fast GPR using IFGT σ 2σ σ Confidence intervals y Training points Mean prediction h x (a) σ (b) Fig. 1. (a) The mean prediction and the error bars obtained when a Gaussian process was used to model the data shown by the black points. A squared exponential covariance function was used. Note that the error bars increase when there is sparse data. The hyperparameters h and σ we chosen by minimizing the negative log-likelihood of the training data the contours of which are shown in (b). most widely used covariance function. The covariance function which we use in this paper is of the form K(x, x ) = ae x x 2 /h 2, (4) where h is the characteristic length scale parameter of the process. This covariance function reflects the fact that nearby inputs will have highly correlated outputs. The parameters h, σ, and a are referred to as the hyperparameters. 3. COMPUTATIONAL AND SPACE COMPLEXITY In this section we discuss the computational complexity of the different phases and show how the computational complexity can be reduced using the conjugategradient method coupled with the IFGT. 3.1 Training Given the hyperparameters h, a, and σ, the training phase consists of the evaluation of the vector ξ = (K + σ 2 I) 1 y which needs the inversion of an N N matrix K+σ 2 I. Direct computation of the inverse of a matrix (using LU decomposition or Gauss-Jordan elimination) requires O(N 3 ) operations and O(N 2 ) storage 1, which is impractical even for problems of moderate size (typically a few thousands) Conjugate gradient method. An effective way is to solve the following large scale linear system using iterative methods. (K + σ 2 I)ξ = y. (5) Since the matrix K = K + σ 2 I is symmetric and positive definite we can use the well known conjugate-gradient method [Hestenes and Stiefel 1952] to solve the linear 1 A numerically more stable and faster method is to compute the inverse using Cholesky factorization [Rasmussen and Williams 2006] which takes N 3 /6 time.

6 6 Raykar and Duraiswami system Kξ = y. The method relies on the fact that the solution to the system of equation Kξ = y minimizes the quadratic form g(ξ) = 1 2 ξt Kξ y T ξ. A good exposition of the method can be found in Chapter 2 of [Kelley 1995]. The idea of using conjugate gradient for Gaussian processes was first suggested by [MacKay and Gibbs 1997]. The iterative method generates a sequence of approximate solutions ξ k at each step which converge to the true solution ξ. One of the sharpest known result for convergence of the iterates is given by [ ] ξ ξ k 2k K κ 1 2, (6) ξ ξ 0 K κ + 1 where the K-norm of w is defined as w K = w T Kw [Kelley 1995]. The constant κ = λ max /λ min, the ratio of the largest to the smallest eigenvalues is called the spectral condition number of the matrix K. Since κ (1, ), Equation 6 implies that if the condition number of K is close to one, then the iterates will converge very quickly. Given a tolerance parameter 0 < η < 1 a practical conjugate-gradient scheme iterates till it computes a vector ξ k such that y Kξ k 2 y Kξ 0 2 η, (7) where y Kξ k 2 is the residual in the Euclidean norm at the end of the k th iteration. Most implementations start the iterate at ξ 0 = 0. The relative residual in the Euclidean norm is related to the relative error in the K-norm as follows [Kelley 1995] y Kξ k 2 y Kξ κ ξ ξ k K 2 [ ] 2k κ 1 κ. (8) 0 2 ξ ξ 0 K κ + 1 This implies that for a given tolerance parameter η the number of iterations required is [ ] ln 2 κ η k [ κ+1 ]. (9) 2 ln κ 1 Sometimes the estimate 9 can be very pessimistic. Even if the condition number is large, the convergence is fast is the eigenvalues are clustered in a few small intervals [Kelley 1995]. The actual implementation of the conjugate gradient method requires one matrixvector multiplication and 5N flops per iteration. Four vectors of length N are required for storage. Hence the computational cost of the conjugate-gradient method is dominated by the matrix-vector product, which is O(kN 2 ), where k is the number of iterations required. The storage is O(N) since the matrix-vector multiplication for our matrix K can be computed without explicitly storing the entire matrix. For O(kN 2 ) cost the conjugate gradient should be scalable, i.e, the number of iterations k should tend to a constant value as N and not grow as a function of N. From Equation 9 it can be seen that the number of iterations depends on the

7 Fast GPR using IFGT 7 spectral condition number κ, which is the ratio of the maximum to the minimum eigen value of the matrix K. If λ is an eigenvalue of K then λ+σ 2 is the corresponding eigen value of K+σ 2 I. The σ 2 term also referred to as the jitter helps to reduce the condition number of the matrix K since in case λ min is very small (resulting in a large κ) the condition number of K + σ 2 I, i.e., (λ max + σ 2 )/(λ min + σ 2 ) will be small. In general, even when the estimate 9 is pessimistic, the convergence is decided by the eigen spectrum of the Gram matrix K. The matrix eigen value problem K ˆφ i = ˆλ i ˆφi, is the discrete approximation for the following continuous eigen value problem K(x, y)p(x)φ i (x)dx = λ i φ i (y), (10) where the weight function p(x) is the underlying density function according to which x is sampled and φ i (x) is the i th eigen function with eigen value λ i. A consequence of this is that the eigen values of the Gram matrix K converge to the eigen values of the integral equation above as N increases [Williams and Seeger 2000]. More specifically the theory of the numerical solution of eigen value problems shows that ˆλ i /N will converge to λ i in the limit that N. This guarantees us that condition number κ and in turn the number of iterations converge to a constant value [Also see Figure 4(a) for empirical results.] ɛ-exact matrix-vector multiplication. The quadratic computational complexity is still too high for large datasets. The core computational step in each conjugate-gradient iteration involves the multiplication of the matrix K with a vector, say q. The j th element of the matrix-vector product Kq can be written as (Kq) j = N aq i e x i x j 2 /h 2. (11) i=1 In general for each target point {y j R d } j=1,...,m (which in our case are same as the source points x i ) this can be written as G(y j ) = N q i e yj xi 2 /h 2. (12) i=1 Eq. 12 is referred to as the discrete Gauss transform in the scientific computing literature, where {q i R} i=1,...,n are referred to as the source weights, {x i R d } i=1,...,n are the source points, i.e., the center of the Gaussians, and h R + is the source scale or bandwidth. In other words G(y j ) is the total contribution at y j of N Gaussians centered at x i each with bandwidth h. Each Gaussian is weighted by the term q i. The computational complexity to evaluate the discrete Gauss transform at M target points is O(MN). The Fast Gauss Transform (FGT) is an ɛ exact approximation algorithm that reduces the computational complexity to O(M + N), at the expense of reduced precision, which however can be arbitrary. The constant depends on the desired precision, dimensionality of the problem, and the bandwidth. Given any ɛ > 0, it computes an approximation Ĝ(y j) to G(y j ) such that the maximum absolute error

8 8 Raykar and Duraiswami Table I. The dominant computational and space complexities for different stages of Gaussian process regression using different methods. We have N training points and M test points. Direct Method Conjugate gradient Conjugate gradient +IFGT Time Space Time Space Time Space Training phase O(N 3 ) O(N 2 ) O(N 2 ) O(N) O(N) O(N) Mean prediction O(MN) O(M + N) O(MN) O(M + N) O(M + N) O(M + N) Uncertainty O(MN 2 ) O(M + N) O(MN 2 ) O(M + N) O(MN) O(M + N) Hyperparameters O(N 3 ) O(N 2 ) O(N 2 ) O(N) O(N) O(N) relative to the total weight Q = N i=1 q i is upper bounded by ɛ, i.e., [ ] max Ĝ(y j) G(y j ) ɛ. (13) y j Q The FGT was first proposed in [Greengard and Strain 1991] and applied successfully to a few lower dimensional applications in mathematics and physics. However the performance degrades exponentially with increasing dimensionality, which makes it impractical for dimensions greater than three. A closely related algorithm (the improved fast Gauss transform or IFGT) that was suitable for higher dimensional problems was presented in [Yang et al. 2005; Raykar et al. 2005]. The constant factor is reduced to asymptotically polynomial order. The reduction was achieved based on a multivariate Taylor expansion scheme combined with the efficient space subdivision using the k-center algorithm. More details of the algorithm can be found in [Raykar et al. 2005]. The IFGT when coupled with the conjugate-gradient procedure reduces the computational complexity to O(N) Adaptively choosing ɛ. When the IFGT is coupled with the conjugategradient the question arises as to how to choose ɛ at each iteration. ɛ can be chosen to a convenient small value such as 10 3 or 10 6 based on the application. For a more theoretical flavor we can properly analyze the effect of the approximation on the conjugate gradient method. The conjugate gradient method is a Krylov subspace method adapted for symmetric positive definite matrix. The Krylov subspace methods at the k th iteration compute an approximation to the solution of Ax = b by minimizing some measure of error over the affine space x 0 + K k, where x 0 is the initial iterate and the k th Krylov subspace is K k = span(r 0, Ar 0, A 2 r 0,..., A k 1 r 0 ). The residual at the k th iterate is r k = b Ax k. A general framework for the understanding the effect of approximate matrix-vector product on Krylov subspace methods for the solution of symmetric and nonsymmetric linear systems of equations is given in [Simoncini and Szyld 2004]. The paper considers the case where at the k th iteration instead of the exact matrix-vector multiplication Av k, the product Av k = (A + E k )v k (14)

9 Fast GPR using IFGT 9 is computed, where E k is an error matrix which can change at every iteration. A neat result in the paper shows how large E k can be at each step while still achieving convergence with the desired tolerance. Let r k = Ax k b be the residual at the end of the k th iteration. Let r k be the corresponding residual when approximate matrix-vector product is used. If at every iteration 1 E k l m δ, (15) r k 1 then at the end of k iterations r k r k δ [Simoncini and Szyld 2004]. The term l m in general is unavailable since it depends on the spectrum of the matrix. However our empirical results and also some experiments in [Simoncini and Szyld 2004] suggest that l m = 1 seems to be a reasonable value. This shows that the matrix-vector product may be performed in an increasingly inexact manner as the iteration progresses and still allow convergence to the solution. For our problem because of the ɛ-exact approximation criterion (Equation 13) every element in the approximation to the vector Kq is within ±Qɛ k of the true value, where Q = N i=1 q i and ɛ k is the error for the IFGT at the k th iteration. Hence the error matrix E k is of the form ±e ±e 1N E k = ɛ. k ±e N1... ±e NN, (16) where e ij = sign(q j ) (+1, 1). It can be seen that E k = Nɛ k. Hence Equation 15 suggests the following strategy to choose ɛ k. This guarantees that ɛ k δ N y Kξ 0 r k 1. (17) y Kξ k 2 y Kξ η + δ. (18) 0 2 Figure 2 shows the ɛ k selected at each iteration for the 1D regression problem discussed in Section 4. As the iteration progresses the ɛ k required increases. Larger the ɛ k better is the speedup achieved for the IFGT. In practice we update ɛ k only when there is a decrease in the residual below the current minimum residual. When the IFGT is coupled with the conjugate gradient there will be a increase in the number of iterations required because of the δ term [Also see Figure 4(b) for empirical results.]. 3.2 Mean prediction Once ξ is computed for any new x the mean prediction is given by N E[f ] = k(x ) T ξ = ξ i k(x i, x ). (19) Predicting at M points is again a matrix vector multiplication operation. Direct computation of E[f ] at M test points due to the N training examples is O(NM). Using the IFGT this computational cost can be reduced to O(N + M). i=1

10 10 Raykar and Duraiswami 10 5 ε Iteration Fig. 2. The error for the IFGT ɛ k selected at each iteration for a sample 1D regression problem. The error tolerance for the CG was set to η = 10 3 and δ = Predictive variances The variance for each prediction is given by Var[f ] = K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ). (20) First we need to compute the inverse of the matrix during the training phase. Once the inverse is computed for each x the computation of uncertainty is O(N 2 ). For M points it is O(MN 2 ). Using the conjugate gradient method and the IFGT we can compute K 1 k(x ) in O(N) time. For M points we need O(MN) time. Table I compares the computational and space complexities for different stages of Gaussian process regression using different methods. 4. REGRESSION EXPERIMENTS The algorithms were programmed in MATLAB with the core computational task of matrix-vector multiplication and IFGT written in C++. The experiments were run on a 1.6 GHz Pentium M processor with 512Mb of RAM. In order to demonstrate the scaling behavior with respect to the number of samples we generated a synthetic one dimensional regression problem. The training data were generated according to the model y = sin x + ε, where ε is independent zero-mean normally distributed noise with variance σ 2 = Figure 3(a) shows the total running time in seconds as a function of N for training using the direct method (Cholesky decomposition), conjugate gradient (CG), and the CG coupled with the IFGT. The error tolerance for CG termination was set to 10 3 and the δ for IFGT was set to The ɛ k we adaptively chosen at each iteration as discussed earlier. The hyperparameters were h = 0.4, σ = 0.2, and a = 1.0. The training time using direct method scales as O(N 3 ), using the CG scales as O(N 2 ) and that using the IFGT is linear O(N). Because of the limited memory we were not able to run the direct inversion after a certain number of data points. Figure 3(b) shows the average mean square training error as a function of N for the different methods. For all the different methods the training errors are almost equal. Figure 3(b) shows the total running time in seconds as a function of N for prediction using the direct method and the IFGT. The test data was generated without any noise. The model was trained on N training samples and tested on M = N

11 Fast GPR using IFGT 11 Total training time (sec) Total prediction time (sec) Direct CG CG+IFGT N 10 6 (a) Direct IFGT N 10 6 (c) Average mean square training error Average mean square prediction error 10 1 Direct CG CG+IFGT N (b) Direct IFGT N (d) Fig. 3. Scaling behavior for a one dimensional Gaussian regression problem. (a) Total running time in seconds as a function of N for training using the direct method (Cholesky decomposition), conjugate gradient (CG), and the CG with the IFGT. (b) The average mean square training error. (c) Total running time in seconds for prediction using the direct methods and using the IFGT. The model was trained on N training samples and tested on M = N samples. (d) The average mean square test error. samples. For the direct method the time scales as O(N 2 ) while that for the IFGT it is O(N). Figure 3(b) shows the average mean square test error. An important aspect to be verified is that the number of iterations needed for the CG procedure tends to a constant as N increases. Figure 4(a) shows the number of iterations as a function of N needed for CG for the same regression problem as discussed above. It can be seen that for CG as N increases the number of iterations reaches a constant value. Figure 4(b) shows the same for CG coupled with IFGT. The error tolerance for CG termination was set to 10 3 and the δ for IFGT was varied. For the CG coupled with IFGT there is an increases in the number of iterations. Smaller the value of δ number of iterations approaches that required by the CG. In the second experiment, we used four publicly available machine learning

12 12 Raykar and Duraiswami CG CG+IFGT Number of iterations η =10 3 η = 10 6 Number of iterations η = 10 3 δ = 10 3 δ = N 10 4 (a) N (b) Fig. 4. Scalability of the conjugate gradient method.(a) The number of iterations taken by the conjugate gradient as a function of N for different termination error η. It can be seen that at N increases the number of iterations reach a constant value. (b) The number of iterations required by the conjugate gradient coupled with the IFGT for two different values of δ. The ɛ k were adaptively chosen. Table II. Speedup and accuracy for large datasets. Ten-fold training and testing accuracy and the training time in seconds for four different datasets. The bandwidth of the Gaussian covariance function was h = 0.5d. The error tolerance for CG termination was set to 10 3 and δ for IFGT was set to The results shown are averaged over all the ten folds. The direct inversion could not be run on large datasets because of limited RAM. The inputs were scaled to lie in a unit hypercube. The outputs were centered to have zero mean. Direct CG CG+IFGT Dataset time(sec) time(sec) time(sec) size error error error dimension Training Testing Training Testing Training Testing Abalone comp-activ pumadyn census-house

13 Fast GPR using IFGT 13 datasets 2. Table II shows the results of ten-fold cross validation experiment on all the five datasets. The running time and the root mean squared error shown for both the training and testing are averaged over all the ten folds. We used h = 0.5d and σ = 0.1 for all the experiments. The direct inversion method could not be run for the larger datasets because of limited RAM. The direct inversion has storage complexity of O(N 2 ). It can be seen that the IFGT accelerated method gives a significant speedup. Also the training error and the prediction error are almost same for all the methods. 5. CHOOSING THE HYPERPARAMETERS The parameters θ = (θ 1, θ 2, θ 3 ) T = (σ, a, h) T are referred to as the hyperparameters. Various methods have been proposed to choose the hyperparameters. In this section we show how the method based on the likelihood can be computed in O(N). For the Gaussian noise model the hyperparameters can be chosen based on the training data. Given a covariance function the marginal log likelihood of the training data is given by l = log p(t θ) = 1 2 yt K 1 y 1 2 log K n log 2π, (21) 2 where K = K + σ 2 I. Optimizing this with respect to θ we can obtain the maximum likelihood estimator of the hyperparameters 3 (See Figure 1(b)). Initializing the hyperparameters to reasonable random values we can use an iterative method like nonlinear conjugate-gradient 4 to search for the optimal hyperparameters. Since the number of hyperparameters are small (two in our case) a small number of iterations are sufficient for convergence. 5.1 Derivatives Each iteration of the conjugate-gradient method requires only the computation of the derivatives and not the function itself. The partial derivatives of l with respect to θ can be expressed analytically as follows. l = 1 K yt K 1 θ k 2 θ K 1 y 1 k 2 tr [ ] 1 K K. (22) θ k 2 abalone [Newman et al. 1998] The task is to predict the age of abalone (number of rings) from physical measurements. comp-active [del ] This database contains various performance measures in a multi-processor, multiuser computer system. The task is to predict the portion of time, that cpus run in the user mode. pumadyn [del ] A realistic simulation of the dynamics of a Puma 560 robot arm. The task is to predict the angular acceleration of one of the robot arm s links. The inputs include angular positions, velocities and torques of the robot arm. census-house [del ] This dataset was constructed from the 1990 US Census. The task is to predict the median price of the house in a small survey region. 3 It should be noted that l can be multimodal. A sensible starting point or multiple random restarts should alleviate this problem. 4 We use the Polak-Ribière formula which often converges more quickly than the Fletcher-Reeves method. Line search is performed using the secant method, which does not need the computation of the second derivatives.

14 14 Raykar and Duraiswami For the squared exponential covariance function specified in Equation 4 the derivatives of K with respect to the two hyperparameters are as follows. K σ K a K h = 2σI, = K 1 where [K 1 ] ij = e x i x j 2 /h 2, = K 2 where [K 2 ] ij = 2 h 3 x i x j 2 e x i x j 2 /h 2. (23) Substituting these derivatives in Equation 22 and using the notation ξ = K 1 y we have l [ ] σ = σ tr K 1 + σ ξ T ξ, l [ ] a = 1 2 tr K 1 K ξt K 1 ξ, l [ ] h = 1 2 tr K 1 K ξt K 2 ξ. (24) We perform the optimization in the log hyperparameter space. With respect to the log hyperparameters the derivatives can be written as follows l = l σ l θ1 = ln θ 1 σ θ 1 σ, l = l a = ln θ l 2 θ 1 a θ 2 a, l = l h = ln θ l 3 θ 2 h θ 3 h. (25) Direct computation of the derivatives requires O(N 3 ) time and O(N 2 ) space. Note that evaluation of tr(ab) does not requires the explicit calculation of AB as we need to evaluate only the diagonal elements to find the trace. The conjugate gradient method along with IFGT can be used to compute ξ = K 1 y in O(N) time and O(N) space. Once ξ is computed, K 1 ξ can be computed in O(N) time using the IFGT. We can use a modified form of the IFGT to compute K 2 ξ in O(N) time and O(N) space. For this we need to compute sums of the form G(y j ) = N q i (y j x i )/h 2 e y j x i 2 /h 2. (26) i=1 This differs from Equation 12 in that the Gaussian is multiplied by a polynomial (y j x i )/h 2. The algorithm is a modification of the original IFGT to handle the modified kernel. The modification requires only a slight change in the implementation. However the error bounds are different. The complete details of the algorithm can be seen in our technical report.

15 Fast GPR using IFGT T N 10 4 Fig. 5. The number of trials T (Equation 30) required to compute tr[ K 1 ] as the function of N [for d = 1, h = 0.5, and σ = 0.1] such that 90% of the time the relative error between he actual and the estimated value of the trace is less than 5% i.e. P r[ ˆα α 0.05α] Computing the trace [ ] In order to compute tr K 1 A (where A is either I or K 1 ), we use the following randomized algorithm as described in [MacKay 2003]. This reduces the computational cost of computing the trace from O(N 3 ) to O(N 2 ). Coupled with the IFGT algorithm we can reduce the cost to O(N). Define τ = d T K 1 Ad, (27) where d is a random vector where each element has a Gaussian distribution with mean zero and unit variance. It can be seen that E[τ] = tr( K ( 1 A), and V ar[τ] = 2tr [ K 1 A] 2). (28) Hence we can estimate the value of α = tr( K 1 A) by averaging over several different τ s. ˆα = 1 T T τ i = 1 T i=1 T d T K i 1 Ad i = 1 T i=1 T ( K 1 d i ) T (Ad i ). (29) i=1 K 1 d i can be computed in O(N) time using conjugate gradient coupled with IFGT. Ad i can be computed in O(N) time using the modified form of the IFGT. In practice the number of τ s needed is small to get a good estimate of the trace. More formally using Chebyshev s inequality we can show that for any 0 < δ < 1, ɛ, and T 2 tr ([ K ) 1 A] 2 δɛ 2 ( ) 2, (30) tr K 1 A P r[ ˆα α ɛα] δ. Experimentally we observed that the number of trails T required is pretty small especially for large values of N (See Figure 5).

16 16 Raykar and Duraiswami Hyperparameter selection time (sec) N (a) Direct Fast σ Direct Fast N (b) h Direct Fast N (c) a Direct Fast N (d) Fig. 6. Scaling behavior for hyperparameter selection.(a) Total running time in seconds as a function of N for hyperparameter selection using the direct and the fast method. (b), (c), and (d) The hyperparameters σ, h, and a chosen by both the methods. Figure 6(a) shows the total running time in seconds as a function of N for hyperparameter selection using the direct and the fast method. Figure 6(b), 6(c),and 6(d) shows the corresponding hyperparameters chosen. 6. CONCLUSION We used the improved fast Gauss transform to reduce the computational complexity of Gaussian process regression to linear O(N). Substantial speedups were achieved on different datasets without any loss of accuracy. In this paper we presented a fast method for the Gaussian covariance function. However various other covariance functions, like the Matern class of kernels can be used. A similar algorithm based on the Taylor s series expansion of the covariance function can be developed. The techniques presented here can be used to speedup classification [Willimas and Barber 1998; Kuss and Rasmussen 2005] using a Gaussian process model. REFERENCES delve/data/datasets.html. 13

17 Fast GPR using IFGT 17 Csato, L. and Opper, M Sparse on-line gaussian processes. Neural Computation 14, 3, Deng, K. and Moore, A Multiresolution instance-based learning. In Proceedings of the Twelfth International Joint Conference on Artificial Intellingence. Morgan Kaufmann, San Francisco, Fine, S. and Scheinberg, K Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2, Gray, A. G. and Moore, A. W Nonparametric density estimation: Toward computational tractability. In SIAM International conference on Data Mining. 4 Greengard, L. and Strain, J The fast Gauss transform. SIAM Journal of Scientific and Statistical Computing 12, 1, Hestenes, M. R. and Stiefel, E Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, Kelley, C. T Iterative Methods for Linear and Nonlinear Equations. SIAM. 6 Kuss, M. and Rasmussen, C. E Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Research 6, Lawrence, N., Seeger, M., and Herbrich, R Advances in Neural Information Processing Systems 15. MIT Press, Chapter Fast Sparse Gaussian Process methods: The Informative Vector Machine, MacKay, D Information Theory, Inference, and Learning Algorithms. Cambridge University Press. 15 MacKay, D. and Gibbs, M. N Efficient implementation of gaussian processes. 6 Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J UCI repository of machine learning databases. mlearn/mlrepository.html. 13 Rasmussen, C. E. and Williams, C. K. I Gaussian Processes for Machine Learning. The MIT Press. 3, 4, 5 Raykar, V. C., Yang, C., Duraiswami, R., and Gumerov, N Fast computation of sums of gaussians in high dimensions. Tech. Rep. CS-TR-4767, Department of Computer Science, University of Maryland, CollegePark. 3, 8 Seeger, M Gaussian processes for machine learning. International Journal of Neural Systems 14, 2, Shen, Y., Ng, A. Y., and Seeger, M Fast gaussian process regression using kd-trees. In NIPS Simoncini, V. and Szyld, D. B Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM J. Sci. Comput. 25, 2, , 9 Smola, A. and Bartlett, B Sparse greedy gaussian process regression. In Advances in Neural Information Processing Systems. MIT Press, Tipping, M Sparse bayesian learning and the relevance vector machine. Journal of machine learning research 1, Tresp, V A bayesian committee machine. Neural Computation 12, 11, Williams, C. K. I. and Rasmussen, C. E Gaussian processes for regression. In Advances in Neural Information Processing Systems. Vol Williams, C. K. I. and Seeger, M The effect of the input density distribution on kernelbased classifiers. In International Conference on Machine Learning 17, P. Langley, Ed. Morgan Kaufmann, Williams, C. K. I. and Seeger, M Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems. MIT Press, Willimas, C. K. I. and Barber, D Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 12, Yang, C., Duraiswami, R., and Davis, L Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems , 8

Computational tractability of machine learning algorithms for tall fat data

Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark