Fast large scale Gaussian process regression using the improved fast Gauss transform

Size: px
Start display at page:

Download "Fast large scale Gaussian process regression using the improved fast Gauss transform"

Transcription

1 Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science and Institute for Advanced Computer Studies University of Maryland, CollegePark, MD Gaussian processes allow the treatment of non-linear non-parametric regression problems in a Bayesian framework. However the computational cost of training such a model with N examples scales as O(N 3 ). Iterative methods for the solution of linear systems can bring this cost down to O(N 2 ), which is still prohibitive for large data sets. In this paper we use an ɛ-exact approximation technique, the improved fast Gauss transform to reduce the computational complexity to O(N), for the squared exponential covariance function. Using the theory of inexact Krylov subspace methods we show how to adaptively increase ɛ as each iteration. For prediction at M points the computational complexity is reduced from O(NM) to O(M + N). We demonstrate the speedup achieved on large synthetic and real data sets. We also show how the hyperparameters can be chosen in O(N) time. Unlike methods which rely on selecting a subset of the data set we use all the available data and still get substantial speedups. Categories and Subject Descriptors: [CS-TR-xxxx/UMIACS-TR-2005-xx]: April 8, 2006

2 2 Raykar and Duraiswami Contents 1 Introduction Novel contributions Relation to previous work The Gaussian process model 4 3 Computational and space complexity Training Conjugate gradient method ɛ-exact matrix-vector multiplication Adaptively choosing ɛ Mean prediction Predictive variances Regression Experiments 10 5 Choosing the hyperparameters Derivatives Computing the trace Conclusion 16

3 Fast GPR using IFGT 3 1. INTRODUCTION A crucial task in supervised learning is regression, where a predictive relationship between the inputs and outputs must be found. Given a set of N i.i.d. training examples T = {x i R d, y i R} N i=1 drawn from an unknown distribution, the task of regression is to predict the output f = f(x ) for a new input value x R d. The function, f : T R d R is called the regression function. The parametric approach assumes a functional form for f, and then estimates the unknown parameters. However, unless the form of the function is known a priori, assuming a certain form very often leads to erroneous inference. On the other hand nonparametric methods do not make any assumptions on the form of the underlying function. The Gaussian process model [Rasmussen and Williams 2006; Seeger 2004] provides a powerful framework to handle the nonparametric regression problems in the Bayesian paradigm. The regression function is represented by an ensemble of functions, on which we place a Gaussian prior. This prior is updated in the light of the training data. As a result we obtain predictions together with valid estimates of uncertainty (See Figure 1 for an illustration). The computational complexity of training a Gaussian process model with N training examples scales as O(N 3 ). The space complexity is O(N 2 ). This makes training prohibitively expensive even for moderate sized datasets (typically a few thousands). The core computational bottleneck is the inversion of the N N Gram matrix. Iterative methods like conjugate-gradient can reduce the computational complexity to O(N 2 ). However the quadratic complexity is still prohibitive for large datasets. The dominant O(N 2 ) complexity in the conjugate-gradient procedure is due to the matrix vector multiplication of the Gram matrix with a vector. A Gaussian process is completely specified by its mean and covariance functions. Different forms of covariance function gives us the flexibility to model different kinds of generative processes. One of the most popular covariance function used is the negative squared exponential (Gaussian). For the Gaussian covariance function we accelerate the matrix vector multiplication using the improved fast Gauss transform (IFGT) [Raykar et al. 2005; Yang et al. 2005]. This reduces the computational and the storage costs to O(N). The algorithm is ɛ exact in the sense that the constant hidden in O(N), depends on the desired accuracy, which however can be arbitrary. In fact for machine precision accuracy there is no difference between the direct and the fast methods. 1.1 Novel contributions Following are the novel contributions of this paper. Training time for Gaussian process regression is reduced to linear O(N) by using the conjugate-gradient method coupled with the IFGT. The prediction time per test input is reduced to O(1) and the time to compute the variance is reduced to O(N). Using results from the theory of inexact Krylov subspace methods we show that the matrix-vector product may be performed in an increasingly inexact manner as the iteration progresses and still allow convergence to the solution. In the end we also show how the hyperparameters can be chosen in O(N). The dominant computation involves computing the sums of polynomial times Gaussian, for which we develop an ɛ-exact approximation algorithm based on

4 4 Raykar and Duraiswami ideas similar to the IFGT. Trace computation is done using a randomized algorithm. 1.2 Relation to previous work Various schemes have been proposed to make Gaussian process regression computationally tractable [Williams and Seeger 2001; Smola and Bartlett 2001; Fine and Scheinberg 2001; Lawrence et al. 2003; Csato and Opper 2002; Tresp 2000; Tipping 2001]. A good review can be found in Chapter 8 of [Rasmussen and Williams 2006]. Most of the proposed schemes are based on using a representative subset of the training examples of size m n. The training time is generally O(m 2 N). Various schemes specify how the effectively choose the subset. Unlike methods which rely on choosing a subset of the dataset we use all the available points and still achieve O(N) complexity. Also in these methods there is no guarantee on the approximation of the kernel matrix in a deterministic sense. It should be noted that most of the above schemes can be further accelerated to O(mN) using the proposed method. A related approach similar to our proposed method is that of [Shen et al. 2005], who use kd-trees to speed up the matrix-vector multiplication. The IFGT gives good speedup for large bandwidth kernels. For small bandwidths dual-tree methods [Deng and Moore 1995; Gray and Moore 2003] can be used. 2. THE GAUSSIAN PROCESS MODEL The simplest most often used model for regression [Williams and Rasmussen 1996] is y = f(x) + ε, where f(x) is a zero-mean Gaussian process with covariance function K(x, x ) : R d R d R and ε is independent zero-mean normally distributed noise with variance σ 2, i.e., N (0, σ 2 ). Therefore the observation process y(x) is a zero-mean Gaussian process with covariance function K(x, x ) + σ 2 δ(x, x ). Given the training data T = {x i, y i } N i=1 the N N covariance matrix K is defined as [K] ij = K(x i, x j ). If we define the vector y = [y 1,..., y N ] T then y is a zero-mean multivariate Gaussian with covariance matrix K + σ 2 I. Given the training data T and a new input x our task is to compute the posterior p(f x, T). Observing that the joint density p(f, y) is a multivariate Gaussian the posterior density p(f x, T) can be shown to be [Rasmussen and Williams 2006] p(f x, T) N ( k(x ) T (K + σ 2 I) 1 y, K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ) ), where k(x ) = [K(x, x 1 ),..., K(x, x N )] T. If we define then the mean prediction is and the variance associated with the prediction is ξ = (K + σ 2 I) 1 y, (1) E[f ] = k(x ) T ξ, (2) Var[f ] = K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ). (3) The covariance function has to be chosen to reflect the prior information we have about the nature of the problem. For high-dimensional problems in the absence of any prior knowledge the isotropic negative squared exponential (Gaussian) is the

5 Fast GPR using IFGT σ 2σ σ Confidence intervals y Training points Mean prediction h x (a) σ (b) Fig. 1. (a) The mean prediction and the error bars obtained when a Gaussian process was used to model the data shown by the black points. A squared exponential covariance function was used. Note that the error bars increase when there is sparse data. The hyperparameters h and σ we chosen by minimizing the negative log-likelihood of the training data the contours of which are shown in (b). most widely used covariance function. The covariance function which we use in this paper is of the form K(x, x ) = ae x x 2 /h 2, (4) where h is the characteristic length scale parameter of the process. This covariance function reflects the fact that nearby inputs will have highly correlated outputs. The parameters h, σ, and a are referred to as the hyperparameters. 3. COMPUTATIONAL AND SPACE COMPLEXITY In this section we discuss the computational complexity of the different phases and show how the computational complexity can be reduced using the conjugategradient method coupled with the IFGT. 3.1 Training Given the hyperparameters h, a, and σ, the training phase consists of the evaluation of the vector ξ = (K + σ 2 I) 1 y which needs the inversion of an N N matrix K+σ 2 I. Direct computation of the inverse of a matrix (using LU decomposition or Gauss-Jordan elimination) requires O(N 3 ) operations and O(N 2 ) storage 1, which is impractical even for problems of moderate size (typically a few thousands) Conjugate gradient method. An effective way is to solve the following large scale linear system using iterative methods. (K + σ 2 I)ξ = y. (5) Since the matrix K = K + σ 2 I is symmetric and positive definite we can use the well known conjugate-gradient method [Hestenes and Stiefel 1952] to solve the linear 1 A numerically more stable and faster method is to compute the inverse using Cholesky factorization [Rasmussen and Williams 2006] which takes N 3 /6 time.

6 6 Raykar and Duraiswami system Kξ = y. The method relies on the fact that the solution to the system of equation Kξ = y minimizes the quadratic form g(ξ) = 1 2 ξt Kξ y T ξ. A good exposition of the method can be found in Chapter 2 of [Kelley 1995]. The idea of using conjugate gradient for Gaussian processes was first suggested by [MacKay and Gibbs 1997]. The iterative method generates a sequence of approximate solutions ξ k at each step which converge to the true solution ξ. One of the sharpest known result for convergence of the iterates is given by [ ] ξ ξ k 2k K κ 1 2, (6) ξ ξ 0 K κ + 1 where the K-norm of w is defined as w K = w T Kw [Kelley 1995]. The constant κ = λ max /λ min, the ratio of the largest to the smallest eigenvalues is called the spectral condition number of the matrix K. Since κ (1, ), Equation 6 implies that if the condition number of K is close to one, then the iterates will converge very quickly. Given a tolerance parameter 0 < η < 1 a practical conjugate-gradient scheme iterates till it computes a vector ξ k such that y Kξ k 2 y Kξ 0 2 η, (7) where y Kξ k 2 is the residual in the Euclidean norm at the end of the k th iteration. Most implementations start the iterate at ξ 0 = 0. The relative residual in the Euclidean norm is related to the relative error in the K-norm as follows [Kelley 1995] y Kξ k 2 y Kξ κ ξ ξ k K 2 [ ] 2k κ 1 κ. (8) 0 2 ξ ξ 0 K κ + 1 This implies that for a given tolerance parameter η the number of iterations required is [ ] ln 2 κ η k [ κ+1 ]. (9) 2 ln κ 1 Sometimes the estimate 9 can be very pessimistic. Even if the condition number is large, the convergence is fast is the eigenvalues are clustered in a few small intervals [Kelley 1995]. The actual implementation of the conjugate gradient method requires one matrixvector multiplication and 5N flops per iteration. Four vectors of length N are required for storage. Hence the computational cost of the conjugate-gradient method is dominated by the matrix-vector product, which is O(kN 2 ), where k is the number of iterations required. The storage is O(N) since the matrix-vector multiplication for our matrix K can be computed without explicitly storing the entire matrix. For O(kN 2 ) cost the conjugate gradient should be scalable, i.e, the number of iterations k should tend to a constant value as N and not grow as a function of N. From Equation 9 it can be seen that the number of iterations depends on the

7 Fast GPR using IFGT 7 spectral condition number κ, which is the ratio of the maximum to the minimum eigen value of the matrix K. If λ is an eigenvalue of K then λ+σ 2 is the corresponding eigen value of K+σ 2 I. The σ 2 term also referred to as the jitter helps to reduce the condition number of the matrix K since in case λ min is very small (resulting in a large κ) the condition number of K + σ 2 I, i.e., (λ max + σ 2 )/(λ min + σ 2 ) will be small. In general, even when the estimate 9 is pessimistic, the convergence is decided by the eigen spectrum of the Gram matrix K. The matrix eigen value problem K ˆφ i = ˆλ i ˆφi, is the discrete approximation for the following continuous eigen value problem K(x, y)p(x)φ i (x)dx = λ i φ i (y), (10) where the weight function p(x) is the underlying density function according to which x is sampled and φ i (x) is the i th eigen function with eigen value λ i. A consequence of this is that the eigen values of the Gram matrix K converge to the eigen values of the integral equation above as N increases [Williams and Seeger 2000]. More specifically the theory of the numerical solution of eigen value problems shows that ˆλ i /N will converge to λ i in the limit that N. This guarantees us that condition number κ and in turn the number of iterations converge to a constant value [Also see Figure 4(a) for empirical results.] ɛ-exact matrix-vector multiplication. The quadratic computational complexity is still too high for large datasets. The core computational step in each conjugate-gradient iteration involves the multiplication of the matrix K with a vector, say q. The j th element of the matrix-vector product Kq can be written as (Kq) j = N aq i e x i x j 2 /h 2. (11) i=1 In general for each target point {y j R d } j=1,...,m (which in our case are same as the source points x i ) this can be written as G(y j ) = N q i e yj xi 2 /h 2. (12) i=1 Eq. 12 is referred to as the discrete Gauss transform in the scientific computing literature, where {q i R} i=1,...,n are referred to as the source weights, {x i R d } i=1,...,n are the source points, i.e., the center of the Gaussians, and h R + is the source scale or bandwidth. In other words G(y j ) is the total contribution at y j of N Gaussians centered at x i each with bandwidth h. Each Gaussian is weighted by the term q i. The computational complexity to evaluate the discrete Gauss transform at M target points is O(MN). The Fast Gauss Transform (FGT) is an ɛ exact approximation algorithm that reduces the computational complexity to O(M + N), at the expense of reduced precision, which however can be arbitrary. The constant depends on the desired precision, dimensionality of the problem, and the bandwidth. Given any ɛ > 0, it computes an approximation Ĝ(y j) to G(y j ) such that the maximum absolute error

8 8 Raykar and Duraiswami Table I. The dominant computational and space complexities for different stages of Gaussian process regression using different methods. We have N training points and M test points. Direct Method Conjugate gradient Conjugate gradient +IFGT Time Space Time Space Time Space Training phase O(N 3 ) O(N 2 ) O(N 2 ) O(N) O(N) O(N) Mean prediction O(MN) O(M + N) O(MN) O(M + N) O(M + N) O(M + N) Uncertainty O(MN 2 ) O(M + N) O(MN 2 ) O(M + N) O(MN) O(M + N) Hyperparameters O(N 3 ) O(N 2 ) O(N 2 ) O(N) O(N) O(N) relative to the total weight Q = N i=1 q i is upper bounded by ɛ, i.e., [ ] max Ĝ(y j) G(y j ) ɛ. (13) y j Q The FGT was first proposed in [Greengard and Strain 1991] and applied successfully to a few lower dimensional applications in mathematics and physics. However the performance degrades exponentially with increasing dimensionality, which makes it impractical for dimensions greater than three. A closely related algorithm (the improved fast Gauss transform or IFGT) that was suitable for higher dimensional problems was presented in [Yang et al. 2005; Raykar et al. 2005]. The constant factor is reduced to asymptotically polynomial order. The reduction was achieved based on a multivariate Taylor expansion scheme combined with the efficient space subdivision using the k-center algorithm. More details of the algorithm can be found in [Raykar et al. 2005]. The IFGT when coupled with the conjugate-gradient procedure reduces the computational complexity to O(N) Adaptively choosing ɛ. When the IFGT is coupled with the conjugategradient the question arises as to how to choose ɛ at each iteration. ɛ can be chosen to a convenient small value such as 10 3 or 10 6 based on the application. For a more theoretical flavor we can properly analyze the effect of the approximation on the conjugate gradient method. The conjugate gradient method is a Krylov subspace method adapted for symmetric positive definite matrix. The Krylov subspace methods at the k th iteration compute an approximation to the solution of Ax = b by minimizing some measure of error over the affine space x 0 + K k, where x 0 is the initial iterate and the k th Krylov subspace is K k = span(r 0, Ar 0, A 2 r 0,..., A k 1 r 0 ). The residual at the k th iterate is r k = b Ax k. A general framework for the understanding the effect of approximate matrix-vector product on Krylov subspace methods for the solution of symmetric and nonsymmetric linear systems of equations is given in [Simoncini and Szyld 2004]. The paper considers the case where at the k th iteration instead of the exact matrix-vector multiplication Av k, the product Av k = (A + E k )v k (14)

9 Fast GPR using IFGT 9 is computed, where E k is an error matrix which can change at every iteration. A neat result in the paper shows how large E k can be at each step while still achieving convergence with the desired tolerance. Let r k = Ax k b be the residual at the end of the k th iteration. Let r k be the corresponding residual when approximate matrix-vector product is used. If at every iteration 1 E k l m δ, (15) r k 1 then at the end of k iterations r k r k δ [Simoncini and Szyld 2004]. The term l m in general is unavailable since it depends on the spectrum of the matrix. However our empirical results and also some experiments in [Simoncini and Szyld 2004] suggest that l m = 1 seems to be a reasonable value. This shows that the matrix-vector product may be performed in an increasingly inexact manner as the iteration progresses and still allow convergence to the solution. For our problem because of the ɛ-exact approximation criterion (Equation 13) every element in the approximation to the vector Kq is within ±Qɛ k of the true value, where Q = N i=1 q i and ɛ k is the error for the IFGT at the k th iteration. Hence the error matrix E k is of the form ±e ±e 1N E k = ɛ. k ±e N1... ±e NN, (16) where e ij = sign(q j ) (+1, 1). It can be seen that E k = Nɛ k. Hence Equation 15 suggests the following strategy to choose ɛ k. This guarantees that ɛ k δ N y Kξ 0 r k 1. (17) y Kξ k 2 y Kξ η + δ. (18) 0 2 Figure 2 shows the ɛ k selected at each iteration for the 1D regression problem discussed in Section 4. As the iteration progresses the ɛ k required increases. Larger the ɛ k better is the speedup achieved for the IFGT. In practice we update ɛ k only when there is a decrease in the residual below the current minimum residual. When the IFGT is coupled with the conjugate gradient there will be a increase in the number of iterations required because of the δ term [Also see Figure 4(b) for empirical results.]. 3.2 Mean prediction Once ξ is computed for any new x the mean prediction is given by N E[f ] = k(x ) T ξ = ξ i k(x i, x ). (19) Predicting at M points is again a matrix vector multiplication operation. Direct computation of E[f ] at M test points due to the N training examples is O(NM). Using the IFGT this computational cost can be reduced to O(N + M). i=1

10 10 Raykar and Duraiswami 10 5 ε Iteration Fig. 2. The error for the IFGT ɛ k selected at each iteration for a sample 1D regression problem. The error tolerance for the CG was set to η = 10 3 and δ = Predictive variances The variance for each prediction is given by Var[f ] = K(x, x ) k(x ) T (K + σ 2 I) 1 k(x ). (20) First we need to compute the inverse of the matrix during the training phase. Once the inverse is computed for each x the computation of uncertainty is O(N 2 ). For M points it is O(MN 2 ). Using the conjugate gradient method and the IFGT we can compute K 1 k(x ) in O(N) time. For M points we need O(MN) time. Table I compares the computational and space complexities for different stages of Gaussian process regression using different methods. 4. REGRESSION EXPERIMENTS The algorithms were programmed in MATLAB with the core computational task of matrix-vector multiplication and IFGT written in C++. The experiments were run on a 1.6 GHz Pentium M processor with 512Mb of RAM. In order to demonstrate the scaling behavior with respect to the number of samples we generated a synthetic one dimensional regression problem. The training data were generated according to the model y = sin x + ε, where ε is independent zero-mean normally distributed noise with variance σ 2 = Figure 3(a) shows the total running time in seconds as a function of N for training using the direct method (Cholesky decomposition), conjugate gradient (CG), and the CG coupled with the IFGT. The error tolerance for CG termination was set to 10 3 and the δ for IFGT was set to The ɛ k we adaptively chosen at each iteration as discussed earlier. The hyperparameters were h = 0.4, σ = 0.2, and a = 1.0. The training time using direct method scales as O(N 3 ), using the CG scales as O(N 2 ) and that using the IFGT is linear O(N). Because of the limited memory we were not able to run the direct inversion after a certain number of data points. Figure 3(b) shows the average mean square training error as a function of N for the different methods. For all the different methods the training errors are almost equal. Figure 3(b) shows the total running time in seconds as a function of N for prediction using the direct method and the IFGT. The test data was generated without any noise. The model was trained on N training samples and tested on M = N

11 Fast GPR using IFGT 11 Total training time (sec) Total prediction time (sec) Direct CG CG+IFGT N 10 6 (a) Direct IFGT N 10 6 (c) Average mean square training error Average mean square prediction error 10 1 Direct CG CG+IFGT N (b) Direct IFGT N (d) Fig. 3. Scaling behavior for a one dimensional Gaussian regression problem. (a) Total running time in seconds as a function of N for training using the direct method (Cholesky decomposition), conjugate gradient (CG), and the CG with the IFGT. (b) The average mean square training error. (c) Total running time in seconds for prediction using the direct methods and using the IFGT. The model was trained on N training samples and tested on M = N samples. (d) The average mean square test error. samples. For the direct method the time scales as O(N 2 ) while that for the IFGT it is O(N). Figure 3(b) shows the average mean square test error. An important aspect to be verified is that the number of iterations needed for the CG procedure tends to a constant as N increases. Figure 4(a) shows the number of iterations as a function of N needed for CG for the same regression problem as discussed above. It can be seen that for CG as N increases the number of iterations reaches a constant value. Figure 4(b) shows the same for CG coupled with IFGT. The error tolerance for CG termination was set to 10 3 and the δ for IFGT was varied. For the CG coupled with IFGT there is an increases in the number of iterations. Smaller the value of δ number of iterations approaches that required by the CG. In the second experiment, we used four publicly available machine learning

12 12 Raykar and Duraiswami CG CG+IFGT Number of iterations η =10 3 η = 10 6 Number of iterations η = 10 3 δ = 10 3 δ = N 10 4 (a) N (b) Fig. 4. Scalability of the conjugate gradient method.(a) The number of iterations taken by the conjugate gradient as a function of N for different termination error η. It can be seen that at N increases the number of iterations reach a constant value. (b) The number of iterations required by the conjugate gradient coupled with the IFGT for two different values of δ. The ɛ k were adaptively chosen. Table II. Speedup and accuracy for large datasets. Ten-fold training and testing accuracy and the training time in seconds for four different datasets. The bandwidth of the Gaussian covariance function was h = 0.5d. The error tolerance for CG termination was set to 10 3 and δ for IFGT was set to The results shown are averaged over all the ten folds. The direct inversion could not be run on large datasets because of limited RAM. The inputs were scaled to lie in a unit hypercube. The outputs were centered to have zero mean. Direct CG CG+IFGT Dataset time(sec) time(sec) time(sec) size error error error dimension Training Testing Training Testing Training Testing Abalone comp-activ pumadyn census-house

13 Fast GPR using IFGT 13 datasets 2. Table II shows the results of ten-fold cross validation experiment on all the five datasets. The running time and the root mean squared error shown for both the training and testing are averaged over all the ten folds. We used h = 0.5d and σ = 0.1 for all the experiments. The direct inversion method could not be run for the larger datasets because of limited RAM. The direct inversion has storage complexity of O(N 2 ). It can be seen that the IFGT accelerated method gives a significant speedup. Also the training error and the prediction error are almost same for all the methods. 5. CHOOSING THE HYPERPARAMETERS The parameters θ = (θ 1, θ 2, θ 3 ) T = (σ, a, h) T are referred to as the hyperparameters. Various methods have been proposed to choose the hyperparameters. In this section we show how the method based on the likelihood can be computed in O(N). For the Gaussian noise model the hyperparameters can be chosen based on the training data. Given a covariance function the marginal log likelihood of the training data is given by l = log p(t θ) = 1 2 yt K 1 y 1 2 log K n log 2π, (21) 2 where K = K + σ 2 I. Optimizing this with respect to θ we can obtain the maximum likelihood estimator of the hyperparameters 3 (See Figure 1(b)). Initializing the hyperparameters to reasonable random values we can use an iterative method like nonlinear conjugate-gradient 4 to search for the optimal hyperparameters. Since the number of hyperparameters are small (two in our case) a small number of iterations are sufficient for convergence. 5.1 Derivatives Each iteration of the conjugate-gradient method requires only the computation of the derivatives and not the function itself. The partial derivatives of l with respect to θ can be expressed analytically as follows. l = 1 K yt K 1 θ k 2 θ K 1 y 1 k 2 tr [ ] 1 K K. (22) θ k 2 abalone [Newman et al. 1998] The task is to predict the age of abalone (number of rings) from physical measurements. comp-active [del ] This database contains various performance measures in a multi-processor, multiuser computer system. The task is to predict the portion of time, that cpus run in the user mode. pumadyn [del ] A realistic simulation of the dynamics of a Puma 560 robot arm. The task is to predict the angular acceleration of one of the robot arm s links. The inputs include angular positions, velocities and torques of the robot arm. census-house [del ] This dataset was constructed from the 1990 US Census. The task is to predict the median price of the house in a small survey region. 3 It should be noted that l can be multimodal. A sensible starting point or multiple random restarts should alleviate this problem. 4 We use the Polak-Ribière formula which often converges more quickly than the Fletcher-Reeves method. Line search is performed using the secant method, which does not need the computation of the second derivatives.

14 14 Raykar and Duraiswami For the squared exponential covariance function specified in Equation 4 the derivatives of K with respect to the two hyperparameters are as follows. K σ K a K h = 2σI, = K 1 where [K 1 ] ij = e x i x j 2 /h 2, = K 2 where [K 2 ] ij = 2 h 3 x i x j 2 e x i x j 2 /h 2. (23) Substituting these derivatives in Equation 22 and using the notation ξ = K 1 y we have l [ ] σ = σ tr K 1 + σ ξ T ξ, l [ ] a = 1 2 tr K 1 K ξt K 1 ξ, l [ ] h = 1 2 tr K 1 K ξt K 2 ξ. (24) We perform the optimization in the log hyperparameter space. With respect to the log hyperparameters the derivatives can be written as follows l = l σ l θ1 = ln θ 1 σ θ 1 σ, l = l a = ln θ l 2 θ 1 a θ 2 a, l = l h = ln θ l 3 θ 2 h θ 3 h. (25) Direct computation of the derivatives requires O(N 3 ) time and O(N 2 ) space. Note that evaluation of tr(ab) does not requires the explicit calculation of AB as we need to evaluate only the diagonal elements to find the trace. The conjugate gradient method along with IFGT can be used to compute ξ = K 1 y in O(N) time and O(N) space. Once ξ is computed, K 1 ξ can be computed in O(N) time using the IFGT. We can use a modified form of the IFGT to compute K 2 ξ in O(N) time and O(N) space. For this we need to compute sums of the form G(y j ) = N q i (y j x i )/h 2 e y j x i 2 /h 2. (26) i=1 This differs from Equation 12 in that the Gaussian is multiplied by a polynomial (y j x i )/h 2. The algorithm is a modification of the original IFGT to handle the modified kernel. The modification requires only a slight change in the implementation. However the error bounds are different. The complete details of the algorithm can be seen in our technical report.

15 Fast GPR using IFGT T N 10 4 Fig. 5. The number of trials T (Equation 30) required to compute tr[ K 1 ] as the function of N [for d = 1, h = 0.5, and σ = 0.1] such that 90% of the time the relative error between he actual and the estimated value of the trace is less than 5% i.e. P r[ ˆα α 0.05α] Computing the trace [ ] In order to compute tr K 1 A (where A is either I or K 1 ), we use the following randomized algorithm as described in [MacKay 2003]. This reduces the computational cost of computing the trace from O(N 3 ) to O(N 2 ). Coupled with the IFGT algorithm we can reduce the cost to O(N). Define τ = d T K 1 Ad, (27) where d is a random vector where each element has a Gaussian distribution with mean zero and unit variance. It can be seen that E[τ] = tr( K ( 1 A), and V ar[τ] = 2tr [ K 1 A] 2). (28) Hence we can estimate the value of α = tr( K 1 A) by averaging over several different τ s. ˆα = 1 T T τ i = 1 T i=1 T d T K i 1 Ad i = 1 T i=1 T ( K 1 d i ) T (Ad i ). (29) i=1 K 1 d i can be computed in O(N) time using conjugate gradient coupled with IFGT. Ad i can be computed in O(N) time using the modified form of the IFGT. In practice the number of τ s needed is small to get a good estimate of the trace. More formally using Chebyshev s inequality we can show that for any 0 < δ < 1, ɛ, and T 2 tr ([ K ) 1 A] 2 δɛ 2 ( ) 2, (30) tr K 1 A P r[ ˆα α ɛα] δ. Experimentally we observed that the number of trails T required is pretty small especially for large values of N (See Figure 5).

16 16 Raykar and Duraiswami Hyperparameter selection time (sec) N (a) Direct Fast σ Direct Fast N (b) h Direct Fast N (c) a Direct Fast N (d) Fig. 6. Scaling behavior for hyperparameter selection.(a) Total running time in seconds as a function of N for hyperparameter selection using the direct and the fast method. (b), (c), and (d) The hyperparameters σ, h, and a chosen by both the methods. Figure 6(a) shows the total running time in seconds as a function of N for hyperparameter selection using the direct and the fast method. Figure 6(b), 6(c),and 6(d) shows the corresponding hyperparameters chosen. 6. CONCLUSION We used the improved fast Gauss transform to reduce the computational complexity of Gaussian process regression to linear O(N). Substantial speedups were achieved on different datasets without any loss of accuracy. In this paper we presented a fast method for the Gaussian covariance function. However various other covariance functions, like the Matern class of kernels can be used. A similar algorithm based on the Taylor s series expansion of the covariance function can be developed. The techniques presented here can be used to speedup classification [Willimas and Barber 1998; Kuss and Rasmussen 2005] using a Gaussian process model. REFERENCES delve/data/datasets.html. 13

17 Fast GPR using IFGT 17 Csato, L. and Opper, M Sparse on-line gaussian processes. Neural Computation 14, 3, Deng, K. and Moore, A Multiresolution instance-based learning. In Proceedings of the Twelfth International Joint Conference on Artificial Intellingence. Morgan Kaufmann, San Francisco, Fine, S. and Scheinberg, K Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2, Gray, A. G. and Moore, A. W Nonparametric density estimation: Toward computational tractability. In SIAM International conference on Data Mining. 4 Greengard, L. and Strain, J The fast Gauss transform. SIAM Journal of Scientific and Statistical Computing 12, 1, Hestenes, M. R. and Stiefel, E Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, Kelley, C. T Iterative Methods for Linear and Nonlinear Equations. SIAM. 6 Kuss, M. and Rasmussen, C. E Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Research 6, Lawrence, N., Seeger, M., and Herbrich, R Advances in Neural Information Processing Systems 15. MIT Press, Chapter Fast Sparse Gaussian Process methods: The Informative Vector Machine, MacKay, D Information Theory, Inference, and Learning Algorithms. Cambridge University Press. 15 MacKay, D. and Gibbs, M. N Efficient implementation of gaussian processes. 6 Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J UCI repository of machine learning databases. mlearn/mlrepository.html. 13 Rasmussen, C. E. and Williams, C. K. I Gaussian Processes for Machine Learning. The MIT Press. 3, 4, 5 Raykar, V. C., Yang, C., Duraiswami, R., and Gumerov, N Fast computation of sums of gaussians in high dimensions. Tech. Rep. CS-TR-4767, Department of Computer Science, University of Maryland, CollegePark. 3, 8 Seeger, M Gaussian processes for machine learning. International Journal of Neural Systems 14, 2, Shen, Y., Ng, A. Y., and Seeger, M Fast gaussian process regression using kd-trees. In NIPS Simoncini, V. and Szyld, D. B Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM J. Sci. Comput. 25, 2, , 9 Smola, A. and Bartlett, B Sparse greedy gaussian process regression. In Advances in Neural Information Processing Systems. MIT Press, Tipping, M Sparse bayesian learning and the relevance vector machine. Journal of machine learning research 1, Tresp, V A bayesian committee machine. Neural Computation 12, 11, Williams, C. K. I. and Rasmussen, C. E Gaussian processes for regression. In Advances in Neural Information Processing Systems. Vol Williams, C. K. I. and Seeger, M The effect of the input density distribution on kernelbased classifiers. In International Conference on Machine Learning 17, P. Langley, Ed. Morgan Kaufmann, Williams, C. K. I. and Seeger, M Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems. MIT Press, Willimas, C. K. I. and Barber, D Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 12, Yang, C., Duraiswami, R., and Davis, L Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems , 8

Computational tractability of machine learning algorithms for tall fat data

Computational tractability of machine learning algorithms for tall fat data Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Scalable machine learning for massive datasets: Fast summation algorithms

Scalable machine learning for massive datasets: Fast summation algorithms Scalable machine learning for massive datasets: Fast summation algorithms Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Efficient Kernel Machines Using the Improved Fast Gauss Transform

Efficient Kernel Machines Using the Improved Fast Gauss Transform Efficient Kernel Machines Using the Improved Fast Gauss Transform Changjiang Yang, Ramani Duraiswami and Larry Davis Department of Computer Science University of Maryland College Park, MD 20742 {yangcj,ramani,lsd}@umiacs.umd.edu

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Preconditioned Krylov solvers for kernel regression

Preconditioned Krylov solvers for kernel regression Preconditioned Krylov solvers for kernel regression Balaji Vasan Srinivasan 1, Qi Hu 2, Nail A. Gumerov 3, Ramani Duraiswami 3 1 Adobe Research Labs, Bangalore, India; 2 Facebook Inc., San Francisco, CA,

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Fast Krylov Methods for N-Body Learning

Fast Krylov Methods for N-Body Learning Fast Krylov Methods for N-Body Learning Nando de Freitas Department of Computer Science University of British Columbia nando@cs.ubc.ca Maryam Mahdaviani Department of Computer Science University of British

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Fast optimal bandwidth selection for kernel density estimation

Fast optimal bandwidth selection for kernel density estimation Fast optimal bandwidt selection for kernel density estimation Vikas Candrakant Raykar and Ramani Duraiswami Dept of computer science and UMIACS, University of Maryland, CollegePark {vikas,ramani}@csumdedu

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Vikas Chandrakant Raykar Doctor of Philosophy, 2007

Vikas Chandrakant Raykar Doctor of Philosophy, 2007 ABSTRACT Title of dissertation: SCALABLE MACHINE LEARNING FOR MASSIVE DATASETS : FAST SUMMATION ALGORITHMS Vikas Chandrakant Raykar Doctor of Philosophy, 2007 Dissertation directed by: Professor Dr. Ramani

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Gaussian Process Regression: Active Data Selection and Test Point Rejection Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin Franklinstr.8,

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

APPROXIMATING GAUSSIAN PROCESSES

APPROXIMATING GAUSSIAN PROCESSES 1 / 23 APPROXIMATING GAUSSIAN PROCESSES WITH H 2 -MATRICES Steffen Börm 1 Jochen Garcke 2 1 Christian-Albrechts-Universität zu Kiel 2 Universität Bonn and Fraunhofer SCAI 2 / 23 OUTLINE 1 GAUSSIAN PROCESSES

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Virtual Sensors and Large-Scale Gaussian Processes

Virtual Sensors and Large-Scale Gaussian Processes Virtual Sensors and Large-Scale Gaussian Processes Ashok N. Srivastava, Ph.D. Principal Investigator, IVHM Project Group Lead, Intelligent Data Understanding ashok.n.srivastava@nasa.gov Coauthors: Kamalika

More information

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Mar Girolami 1 Department of Computing Science University of Glasgow girolami@dcs.gla.ac.u 1 Introduction

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

THE presence of missing values in a dataset often makes

THE presence of missing values in a dataset often makes 1 Efficient EM Training of Gaussian Mixtures with Missing Data Olivier Delalleau, Aaron Courville, and Yoshua Bengio arxiv:1209.0521v1 [cs.lg] 4 Sep 2012 Abstract In data-mining applications, we are frequently

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Learning Gaussian Process Kernels via Hierarchical Bayes

Learning Gaussian Process Kernels via Hierarchical Bayes Learning Gaussian Process Kernels via Hierarchical Bayes Anton Schwaighofer Fraunhofer FIRST Intelligent Data Analysis (IDA) Kekuléstrasse 7, 12489 Berlin anton@first.fhg.de Volker Tresp, Kai Yu Siemens

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

Kernel Conjugate Gradient

Kernel Conjugate Gradient Kernel Conjugate Gradient Nathan Ratliff Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 ndr@andrew.cmu.edu J. Andrew Bagnell Robotics Institute Carnegie Mellon University Pittsburgh,

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics 1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009 Motivation 2 / 38 Examples

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

Improved Fast Gauss Transform. Fast Gauss Transform (FGT)

Improved Fast Gauss Transform. Fast Gauss Transform (FGT) 10/11/011 Improved Fast Gauss Transform Based on work by Changjiang Yang (00), Vikas Raykar (005), and Vlad Morariu (007) Fast Gauss Transform (FGT) Originally proposed by Greengard and Strain (1991) to

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information