EigenGP: Gaussian Process Models with Adaptive Eigenfunctions

Size: px

Start display at page:

Download "EigenGP: Gaussian Process Models with Adaptive Eigenfunctions"

Walter Theodore Underwood
5 years ago
Views:

1 Proceedings of the Twent-Fourth International Joint Conference on Artificial Intelligence (IJCAI 5) : Gaussian Process Models with Adaptive Eigenfunctions Hao Peng Department of Computer Science Purdue Universit West Lafaette, I 797, USA pengh@purdue.edu Yuan Qi Departments of Computer Science and Statistics Purdue Universit West Lafaette, I 797, USA alanqi@cs.purdue.edu Abstract Gaussian processes (GPs) provide a nonparametric representation of functions. However, classical GP inference suffers from high computational cost for big data. In this paper, we propose a new Baesian approach,, that learns both basis dictionar elements eigenfunctions of a GP prior and prior precisions in a sparse finite model. It is well known that, among all orthogonal basis functions, eigenfunctions can provide the most compact representation. Unlike other sparse Baesian finite models where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space as a finite linear combination of kernel functions. We learn the dictionar elements eigenfunctions and the prior precisions over these elements as well as all the other hperparameters from data b maimizing the model marginal likelihood. We eplore computational linear algebra to simplif the gradient computation significantl. Our eperimental results demonstrate improved predictive performance of over alternative sparse GP methods as well as relevance vector machines. Introduction Gaussian processes (GPs) are powerful nonparametric Baesian models with numerous applications in machine learning and statistics. GP inference, however, is costl. Training the eact GP regression model with samples is epensive: it takes an O( ) space cost and an O( 3 ) time cost. To address this issue, a variet of approimate sparse GP inference approaches have been developed [Williams and Seeger, ; Csató and Opper, ; Snelson and Ghahramani, 6; Lázaro-Gredilla et al., ; Williams and Barber, 998; Titsias, 9; Qi et al., ; Higdon, ; Cressie and Johannesson, 8] for eample, using the ström method to approimate covariance matrices [Williams and Seeger, ], optimizing a variational bound on the marginal likelihood [Titsias, 9] or grounding the GP on a small set of (blurred) basis points [Snelson and Ghahramani, 6; Qi et al., ]. An elegant unifing view for various sparse GP regression models is given b Quiñonero-Candela and Rasmussen [5]. Among all sparse GP regression methods, a state-of-theart approach is to represent a function as a sparse finite linear combination of pairs of trigonometric basis functions, a sine and a cosine for each spectral point; thus this approach is called sparse spectrum Gaussian process () [Lázaro- Gredilla et al., ]. integrates out both weights and phases of the trigonometric functions and learns all hperparameters of the model (frequencies and amplitudes) b maimizing the marginal likelihood. Using global trigonometric functions as basis functions, has the capabilit of approimating an stationar Gaussian process model and been shown to outperform alternative sparse GP methods including full independent training conditional () approimation [Snelson and Ghahramani, 6] on benchmark datasets. Another popular sparse Baesian finite linear model is the relevance vector machine (RVM) [Tipping, ; Faul and Tipping, ]. It uses kernel epansions over training samples as basis functions and selects the basis functions b automatic relevance determination [MacKa, 99; Faul and Tipping, ]. In this paper, we propose a new sparse Baesian approach,, that learns both functional dictionar elements eigenfunctions and prior precisions in a finite linear model representation of a GP. It is well known that, among all orthogonal basis functions, eigenfunctions provide the most compact representation. Unlike or RVMs where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space (RKHS) as a finite linear combination of kernel functions with their weights learned from data. We further marginalize out weights over eigenfunctions and estimate all hperparameters including basis points for eigenfunctions, lengthscales, and precision of the weight prior b maimizing the model marginal likelihood (also known as evidence). To do so, we eplore computational linear algebra and greatl simplif the gradient computation for optimization (thus our optimization method is totall different from RVM optimization methods). As a result of this optimization, our eigenfunctions are data dependent and make capable of accuratel modeling nonstationar data. Furthermore, b adding an additional kernel term in our model, we can turn the finite model into an infinite model to model the prediction uncertaint better that 3763

2 is, it can give nonzero prediction variance when a test sample is far from the training samples. is computationall efficient. It takes O( M) space and O(M ) time for training on with M basis functions, which is same as and more efficient than RVMs (as RVMs learn weights over, not M, basis functions.). Similar to and, focuses on predictive accurac at low computational cost, rather than on faithfull converging towards the full GP as the number of basis functions grows. (For the latter case, please see the approach [Yan and Qi, ] that eplicitl minimizes the KL divergence between eact and approimate GP posterior processes.) The rest of the paper is organized as follows. Section describes the background of GPs. Section 3 presents the model and an illustrative eample. Section outlines the marginal likelihood maimization for learning dictionar elements and the other hperparameters. In Section 5, we discuss related work. Section 6 shows regression results on multiple benchmark regression datasets, demonstrating improved performance of over ström [Williams and Seeger, ], RVM,, and. Background of Gaussian Processes We denote independent and identicall distributed samples as D = {(, ),..., ( n, n )}, where i is a D dimensional input (i.e., eplanator variables) and i is a scalar output (i.e., a response), which we assume is the nois realization of a latent function f at i. A Gaussian process places a prior distribution over the latent function f. Its projection f at { i } i= defines a joint Gaussian distribution p(f ) = (f m, K), where, without an prior preference, the mean m are set to and the covariance function k( i, j ) K( i, j ) encodes the prior notion of smoothness. A popular choice is the anisotropic squared eponential covariance function: k(, ) = a ep ( ( ) T diag(η)( ) ), where the hperparameters include the signal variance a and the lengthscales η = {η d } D d=, controlling how fast the covariance decas with the distance between inputs. Using this covariance function, we can prune input dimensions b shrinking the corresponding lengthscales based on the data (when η d =, the d-th dimension becomes totall irrelevant to the covariance function value). This pruning is known as Automatic Relevance Determination (ARD) and therefore this covariance is also called the ARD squared eponential. ote that the covariance function value remains the same when ( ) is the same regardless where and are. This thus leads to a stationar GP model. For nonstationar data, however, a stationar GP model is a misfit. Although nonstationar GP models have been developed and applied to real world applications, the are often limited to low-dimensional problems, such as applications in spatial statistics [Paciorek and Schervish, ]. Constructing general nonstationar GP models remains a challenging task. For regression, we use a Gaussian likelihood function p( i f) = ( i f( i ), σ ), where σ is the variance of the observation noise. Given the Gaussian process prior over f and the data likelihood, the eact posterior process is p(f D, ) GP (f, k) i= p( i f). Although the posterior process for GP regression has an analtical form, we need to store and invert an b matri, which has the computational compleit O( 3 ), rendering GP unfeasible for big data analtics. 3 Model of To enable fast inference and obtain a nonstationar covariance function, our new model projects the GP prior in an eigensubspace. Specificall, we set the latent function f() = α j φ j () () j= where M and {φ j ()} are eigenfunctions of the GP prior. We assign a Gaussian prior over α = [α,..., α M ], α (, diag(w)), so that f follows a GP prior with zero mean and the following covariance function k(, ) = w j φ j ()φ j ( ). () j= To compute the eigenfunctions {φ j ()}, we can use the Galerkin projection to approimate them b Hermite polnomials [Marzouk and ajm, 9]. For high dimensional problems, however, this approach requires a tensor product of univariate Hermite polnomials that dramaticall increases the number of parameters. To avoid this problem, we appl the ström method [Williams and Seeger, ] that allows us to obtain an approimation to the eigenfunctions in a high dimensional space efficientl. Specificall, given inducing inputs (i.e. basis points) B = [b,..., b M ], we replace k(, )φ j ()p()d = λ j φ j ( ) (3) b its Monte Carlo approimation M k(, b i )φ j (b i ) λ j φ j () () i= Then, b evaluating this equation at B so that we can estimate the values of φ j (b i ) and λ j, we obtain the j-th eigenfunction φ j () as follows M φ j () = k()u (M) λ (M) j = k()u j (5) j where k() [k(, b ),..., k(, b M )], λ (M) j and u (M) j are the j-th eigenvalue and eigenvector of the covariance function evaluated at B, and u j = Mu (M) j /λ (M) j. ote that, for simplicit, we have chosen the number of the inducing inputs to be the same as the number of the eigenfunctions; in practice, we can use more inducing inputs while computing onl the top M eigenvectors. As shown in (5), our eigenfunction lives in a RKHS as a linear combination of the kernel functions evaluated at B with weights u j. 376

3 Inserting (5) into () we obtain f() = M α j j= i= u ij k(, b i ) (6) This equation reveals a two-laer structure of. The first laer linearl combines multiple kernel functions to generate each eigenfunction φ j. The second laer takes these eigenfunctions as the basis functions to generate the function value f. ote that f is a Baesian linear combination of {φ j } where the weights α are integrated out to avoid overfitting. All the model hperparameters are learned from data. Specificall, for the first laer, to learn the eigenfunctions {φ i }, we estimate the inducing inputs B and the kernel hperparameters (such as lengthscales η for the ARD kernel) b maimizing the model marginal likelihood. For the second laer, we marginalize out α to avoid overfitting and maimize the model marginal likelihood to learn the hperparameter w of the prior. With the estimated hperparameters, the prior over f is nonstationar because its covariance function in () varies at different regions of. This comes at no surprise since the eigenfunctions are tied with p() in (3). This nonstationarit reflects the fact that our model is adaptive to the distribution of the eplanator variables. ote that to recover the full uncertaint captured b the kernel function k, we can add the following term into the kernel function of : δ( )(k(, ) k(, )) (7) where δ(a) = if and onl if a =. Compared to the original model, which has a finite degree of freedom, this modified model has the infinite number of basis functions (assuming k has an infinite number of basis functions as the ARD kernel). Thus, this model can accuratel model the uncertaint of a test point even when it is far from the training set. We derive the optimization updates of all the hperparameters for both the original and modified models. But according to our eperiments, the modified model does not improve the prediction accurac over the original (it even reduces the accurac sometimes.). Therefore, we will focus on the original model in our presentation for its simplicit. Before we present details about hperparameter optimization, let us first look at an illustrative eample on the effect of hperparameter optimization, in particular, the optimization of the inducing inputs B. 3. Illustrative Eample For this eample, we consider a to dataset used b the algorithm [Snelson and Ghahramani, 6]. It contains one-dimensional training samples. We use 5 basis points (M = 5) and choose the ARD kernel. We then compare the basis functions {φ j } and the corresponding predictive distributions in two cases. For the first case, we use the kernel width η learned from the full GP model as the kernel width for, and appl K-means to set the basis points B as cluster centers. The idea of using K-means to set the basis points has been suggested b Zhang et al. [8] to minimize an error bound for the ström approimation. For Predictions Eigenfunctions Basis points learned from K-means Basis points estimated b evidence maimization 6 6 Figure : Illustration of the effect of optimizing basis points (i.e., inducing inputs). In the first row, the pink dots represent data points, the blue and red solid curves correspond to the predictive mean of and ± two standard deviations around the mean, the black curve corresponds to the predictive mean of the full GP, and the black crosses denote the basis points. In the second row, the curves in various colors represent the five eigenfunctions of. the second case, we optimize η, B and w b maimizing the marginal likelihood of. The results are shown in Figure. The first row demonstrates that, b optimizing the hperparameters, achieves the predictions ver close to what the full GP achieves but using onl 5 basis functions. In contrast, when the basis points are set to the cluster centers b K-means, leads to the prediction significantl different from that of the full GP and fails to capture the data trend, in particular, for (3, 5). The second row of Figure shows that K-means sets the basis points almost evenl spaced on, and accordingl the five eigenfunctions are smooth global basis functions whose shapes are not directl linked to the function the fit. Evidence maimization, b contrast, sets the basis points unevenl spaced to generate the basis functions whose shapes are more localized and adaptive to the function the fit; for eample, the eigenfunction represented b the red curve well matches the data on the right. Learning Hperparameters In this section we describe how we optimize all the hperparameters, denoted b θ, which include B in the covariance function (), and all the kernel hperparameters (e.g., a and η). To optimize θ, we maimize the marginal likelihood (i.e. evidence) based on a conjugate ewton method. We eplore two strategies for evidence maimization. The first one is sequential optimization, which first fies w while updating all the other hperparameters, and then optimizes w while fiing the other hperparameters. The second strateg is to optimize We use the code from code/matlab/doc 3765

4 all the hperparameters jointl. Here we skip the details for the more complicated joint optimization and describe the ke gradient formula for sequential optimization. The ke computation in our optimization is for the log marginal likelihood and its gradient: ln p( θ) = ln C T C ln(π), (8) [ d ln p( θ) = tr(c dc ) tr(c T C dc )] (9) where C = K + σ I, K = Φdiag(w)Φ T, and Φ = {φ m ( n )} is an b M matri. Because the rank of K is M, we can compute ln C and C efficientl with the cost of O(M ) via the matri inversion and determinant lemmas. Even with the use of the matri inverse lemma for the low-rank computation, a naive calculation would be ver costl. We appl identities from computational linear algebra [Minka, ; de Leeuw, 7] to simplif the needed computation dramaticall. To compute the derivative with respect to B, we first notice that, when w is fied, we have C = K + σ I = K XB K BB K BX + σ I () where K XB is the cross-covariance matri between the training data X and the inducing inputs B, and K BB is the covariance matri on B. For the ARD squared eponential kernel, utilizing the following identities, tr(p T Q) = vec(p) T vec(q) and vec(p Q) = diag(vec(p))vec(q), where vec( ) vectorizes a matri into a column vector, and represents the Hadamard product, we can derive the derivative of the first trace term in (9) tr(c dc ) = RX T diag(η) (R T ) (B T diag(η)) db SB T diag(η) + (S T ) (B T diag(η)) () where is a column vector of all ones, and R = (K BB K BX C ) K BX () S = (K BB K BX C K XBK BB ) K BB (3) ote that we can compute K BX C efficientl via low-rank updates. Also, R T (S T ) can be implemented efficientl b first summing over the columns of S (R) and then coping it multiple times without an multiplication operation. To obtain tr(c T C db and (3) b C dc ), we simpl replace C in () T C. Using similar derivations, we can obtain the derivatives with respect to the lengthscale η, a and σ respectivel. To compute the derivative with respect to w, we can use the formula tr(pdiag(w)q) = T (Q T P)w to obtain the two trace terms in (9) as follows: tr(c dc ) = T (Φ (C Φ)) dw () tr(c T C dc ) = T (Φ (C T C Φ)) dw (5) For either sequential or joint optimization, the overall computational compleit is O(ma(M, D)) where D is the data dimension. 5 Related Work Our work is closel related to the seminal work b [Williams and Seeger, ], but the differ in multiple aspects. First, we define a valid probabilistic model based on an eigendecomposition of the GP prior. B contrast, the previous approach [Williams and Seeger, ] aims at a low-rank approimation to the finite covariance/kernel matri used in GP training from a numerical approimation perspective and its predictive distribution is not well-formed in a probabilistic framework (e.g., it ma give a negative variance of the predictive distribution.). Second, while the ström method simpl uses the first few eigenvectors, we maimize the model marginal likelihood to adjust their weights in the covariance function. Third, eploring the clustering propert of the eigenfunctions of the Gaussian kernel, our approach can conduct semi-supervised learning, while the previous one cannot. The semi-supervised learning capabilit of is investigated in another paper of us. Fourth, the ström method lacks a principled wa to learn model hperparameters including the kernel width and the basis points while does not. Our work is also related to methods that use kernel principle component analsis (PCA) to speed up kernel machines [Hoegaerts et al., 5]. However, for these methods it can be difficult if not impossible to learn important hperparameters including kernel width for each dimension and inducing inputs (not a subset of the training samples). B contrast, learns all these hperparameters from data based on gradients of the model marginal likelihood. 6 Eperimental Results In this section, we compare and alternative methods on snthetic and real benchmark datasets. The alternative methods include the sparse GP methods,, and the ström method as well as RVMs. We implemented the ström method ourselves and downloaded the software implementations for the other methods from their authors websites. For RVMs, we used the fast fied point algorithm [Faul and Tipping, ]. We used the ARD kernel for all the methods ecept RVMs (since the do not estimate the lengthscales in this kernel) and optimized all the hperparameters via evidence maimization. For RVMs, we chose the squared eponential kernel with the same lengthscale for all the dimensions and applied a -fold cross-validation on the training data to select the lengthscale. On large real data, we used the values of η, a, and σ learned from the full GP on a subset that was / of the training data to initialize all the methods ecept RVMs. For the rest configurations, we used the default setting of the downloaded software packages. For our own model, we denote the versions with sequential and joint optimization as and *, respectivel. To evaluate the test performance of each method, we measure the ormalized Mean Square Error (MSE) and the Mean The implementation is available at:

5 egative Log Probabilit (MLP), defined as: MSE = i ( i µ i ) / i (µ i ȳ) (6) MLP = i µi i [( σ i ) + ln σi + ln π] (7) where i, µ i and σi are the response value, the predictive mean and variance for the i-th test point respectivel, and ȳ is the average response value of the training data. 6. Approimation Qualit on Snthetic Data Table : MSE and MLP on snthetic data MSE Method dataset dataset ström 39 ± 8 56 ± 769 ström*. ±.53 7 ± 37. ±.5.5 ±..5 ±.. ±.5.6 ±..6 ±..9 ±..6 ±. (a) ström (b) (c) (d) MLP Method dataset dataset ström 65 ± ± 67 ström* 7.39 ±.66 ± 5.7 ±..88 ±.5. ±.3.87 ±.9.33 ±.. ±.7.3 ±.. ± Figure : Predictions of four sparse GP methods. Pink dots represent training data; blue curves are predictive means; red curves are two standard deviations above and below the mean curves, and the black crosses indicate the inducing inputs. As in Section 3., we use the snthetic data from the paper for the comparative stud. To let all the methods have the same computational compleit, we set the number of inducing inputs M = 7. The results are summarized in Figure. For the ström method, we used the kernel width learned from the full GP and applied K-means to choose the basis locations [Zhang et al., 8]. Figure (a) shows that it does not fit well. Figure (b) demonstrates that the prediction of oscillates outside the range of the training samples, probabl due to the fact that the sinusoidal components are global and span the whole data range (increasing the number of basis functions would improve s predictive performance, but increase the computational cost.). As shown b Figure (c), fails to capture the turns of the data accuratel for near while can. Using the full GP predictive mean as the label for [, 7] (we do not have the true values in the test data), we compute the MSE and MLP of all the methods. The average results from runs are reported in Table (dataset ). We have the results from two versions of the ström method. For the first version, the kernel width is learned from the full GP and the basis locations are chosen b K-means as before; for the second version, denoted as ström*, its hperparameters are learned b evidence maimization. ote that the evidence maimization algorithm for the ström approimation is novel too developed b us for the comparative analsis. Table shows that both and * approimate the mean of the full GP model more accuratel than the other methods, in particular, several orders of magnitude better than the ström method. Furthermore, we add the difference term (7) into the kernel function and denote this version of our algorithm as +. It gives better predictive variance when far from the training data but its predictive mean is slightl worse than the version without this term (7); the MSE and MLP of + are. ±. and.8 ±.. Thus, on the other datasets, we onl use the versions without this term ( and ) for their simplicit and effectiveness. We also eamine the performance of all these methods with a higher computational compleit. Specificall, we set M =. Again, both versions of the ström method give poor predictive distributions. And still leads to etra wav patterns outside the training data., and + give good predictions. Again, + gives better predictive variance when far from the training data, but with a similar predictive mean as. Finall, we compare the RVM with * on this dataset. While the RVM gives MSE =.8 in. seconds, * achieves MSE =.39 ±.7 in.33 ±. second with M = 3 ( performs similarl), both faster and more accurate. 6. Prediction Qualit on onstationar Data We then compare all the sparse GP methods on an onedimensional nonstationar snthetic dataset with training and 5 test samples. The underling function is f() = sin( 3 ) where (, 3) and the standard deviation of the white noise is.5. This function is nonstationar in the sense that its frequenc and amplitude increase when increases from. We randoml generated the data times and set the number of basis points (functions) to be for all the com- 3767

6 (a) (b) (c) Figure 3: Predictions on nonstationar data. The pink dots correspond to nois data around the true function f() = sin( 3 ), represented b the green curves. The blue and red solid curves correspond to the predictive means and ± two standard deviations around the means. The black crosses near the bottom represent the estimated basis points for and. petitive methods. Using the true function value as the label, we compute the means and the standard errors of MSE and MLP as in Table (dataset ). For the ström method, the marginal likelihood optimization leads to much smaller error than the K-means based approach. However, both of them fare poorl when compared with the alternative methods. Table also shows that and achieve a striking 5, fold error reduction compared with ström*, and a -fold error reduction compared with the second best method,. RVMs gave MSE.±. with. ±.5 seconds, averaged over runs, while the results of * with M = 5 are MSE. ±.6 with.89 ±. seconds ( gives similar results). We further illustrate the predictive mean and standard deviation on a tpical run in Figure 3. As shown in Figure 3(a), the predictive mean of contains reasonable high frequenc components for (, 3) but, as a stationar GP model, these high frequenc components give etra wav patterns in the left region of. In addition, the predictive mean on the right is smaller than the true one, probabl affected b the small dnamic range of the data on the left. Figure 3(b) shows that the predictive mean of at (, 3) has lower frequenc and smaller amplitude than the true function perhaps influenced b the low-frequenc part on the left (, ). Actuall because of the low-frequenc part, learns a large kernel width η; the average kernel width learned over the runs is This large value affects the qualit of learned basis points (e.g., lacking of basis points for the high frequenc region on the right). B contrast, using the same initial kernel width as, learns a suitable kernel width on average, η =.7 and provides good predictions as shown in Figure 3(c). 6.3 Accurac vs. Time on Real Data To evaluate the trade-off between prediction accurac and computational cost, we use three large real datasets. The first dataset is California Housing [Pace and Barr, 997]. We randoml split the 8 dimensional data into, training and,6 test points. The second dataset is Phsicochemical Properties of Protein Tertiar Structures (PPPTS) which can be obtained from Lichman [3]. We randoml split the 9 dimensional data into, training and 5,73 test points. The third dataset is Pole Telecomm that was used in Làzaro-Gredilla et al. []. It contains, training and 5 test samples, each of which has 6 features. We set M = 5, 5,,,, and the maimum number of iterations in optimization to be for all methods. The MSE, MLP and the training time of these methods are shown in Figure. In addition, we ran the ström method based on the marginal likelihood maimization, which is better than using K-means to set the basis points. Again, the ström method performed orders of magnitude worse than the other methods: with M = 5, 5,,,, on California Housing, the ström method uses 6, 83, 3, 359 and 75 seconds for training, respectivel, and gives the MSE 97,, 3, 37, and 95; on PPPTS, the training times are 58, 3, 5, 853 and seconds and the MSEs are.7 5, 6.,.3, 8. 3, and 8. 3 ; and on Pole Telecomm, the training times are 79, 5, 67, 78, and 959 seconds and the MSEs are.3, 5.8 3, 3.8 3,.5 and 8. The MLPs are consistentl large, and are omitted here for simplicit. For RVMs, we include cross-validation in its training time because choosing an appropriate kernel width is crucial for RVM. Since RVM learns the number of basis functions automaticall from the data, in Figure it shows a single result for each dataset. achieves the lowest prediction error using shorter time. Compared with based on the sequential optimization, achieves similar errors, but takes longer because the joint optimization is more epensive. 7 Conclusions In this paper we have presented a simple et effective sparse Gaussian process method,, and applied it to regression. Despite its similarit to the ström method, can improve its prediction qualit b several orders of magnitude. can be easil etended to conduct online learning b either using stochastic gradient descent to update the weights of the eigenfunctions or appling the online VB idea for GPs [Hensman et al., 3]. Acknowledgments This work was supported b SF ECCS-93, SF CA- REER award IIS-593, and the Center for Science of Information, an SF Science and Technolog Center, under grant agreement CCF

7 (a) California Housing MSE (b) PPPTS MSE (c) Pole Telecomm MSE RVM * MLP RVM.6.55 *.5.5 MLP. 6.5 RVM..9 *.6.3 MLP 3 MSE * * * MLP Figure : MSE and MLP vs. training time. Each method (ecept RVMs) has five results associated with M = 5, 5,,,, respectivel. In (c), the fifth result of is out of the range; the actual training time is 385 seconds, the MSE.33, and the MLP.7. The values of MLP for RVMs are.3,.5 and.95, respectivel. References [Cressie and Johannesson, 8] oel Cressie and Gardar Johannesson. Fied rank kriging for ver large spatial data sets. Journal of the Roal Statistical Societ: Series B (Statistical Methodolog), 7():9 6, Februar 8. [Csató and Opper, ] Lehel Csató and Manfred Opper. Sparse online Gaussian processes. eural Computation, :6 668, March. [de Leeuw, 7] Jan de Leeuw. Derivatives of generalized eigen sstems with applications. In Department of Statistics Papers. Department of Statistics, UCLA, UCLA, 7. [Faul and Tipping, ] Anita C. Faul and Michael E. Tipping. Analsis of sparse Baesian learning. In Advances in eural Information Processing Sstems. MIT Press,. [Hensman et al., 3] James Hensman, icolò Fusi, and eil D. Lawrence. Gaussian processes for big data. In the 9th Conference on Uncertaint in Articial Intelligence (UAI 3), pages 8 9, 3. [Higdon, ] Dave Higdon. Space and space-time modeling using process convolutions. In Clive W. Anderson, Vic Barnett, Philip C. Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative methods for current environmental issues, pages Springer Verlag,. [Hoegaerts et al., 5] Luc Hoegaerts, Johan A. K. Sukens, Joos Vandewalle, and Bart De Moor. Subset based least squares subspace regression in RKHS. eurocomputing, 63:93 33, Januar 5. [Lázaro-Gredilla et al., ] Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl E. Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, :865 88,. [Lichman, 3] Moshe Lichman. UCI machine learning repositor, 3. [MacKa, 99] David J. C. MacKa. Baesian interpolation. eural Computation, (3):5 7, 99. [Marzouk and ajm, 9] Youssef M. Marzouk and Habib. ajm. Dimensionalit reduction and polnomial chaos acceleration of Baesian inference in inverse problems. Journal of Computational Phsics, 8(6):86 9, 9. [Minka, ] Thomas P. Minka. Old and new matri algebra useful for statistics. Technical report, MIT Media Lab,. [Pace and Barr, 997] R. Kelle Pace and Ronald Barr. Sparse spatial autoregressions. Statistics and Probabilit Letters, 33(3):9 97, 997. [Paciorek and Schervish, ] Christopher J. Paciorek and Mark J. Schervish. onstationar covariance functions for Gaussian process regression. In Advances in eural Information Processing Sstems 6. MIT Press,. [Qi et al., ] Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka. Sparse-posterior Gaussian processes for general likelihoods. In Proceedings of the 6th Conference on Uncertaint in Artificial Intelligence,. [Quiñonero-Candela and Rasmussen, 5] Joaquin Quiñonero- Candela and Carl E. Rasmussen. A unifing view of sparse approimate Gaussian process regression. Journal of Machine Learning Research, 6: , 5. [Snelson and Ghahramani, 6] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in eural Information Processing Sstems 8. MIT press, 6. [Tipping, ] Michael E. Tipping. The relevance vector machine. In Advances in eural Information Processing Sstems. MIT Press,. [Titsias, 9] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In The th International Conference on Artificial Intelligence and Statistics, pages , 9. [Williams and Barber, 998] Christopher K. I. Williams and David Barber. Baesian classification with Gaussian processes. IEEE Transactions on Pattern Analsis and Machine Intelligence, ():3 35, 998. [Williams and Seeger, ] Christopher K. I. Williams and Matthias Seeger. Using the ström method to speed up kernel machines. In Advances in eural Information Processing Sstems 3, volume 3. MIT Press,. [Yan and Qi, ] Feng Yan and Yuan Qi. Sparse Gaussian process regression via l penalization. In Proceedings of 7th International Conference on Machine Learning, pages 83 9,. [Zhang et al., 8] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved ström low-rank approimation and error analsis. In Proceedings of the 5th International Conference on Machine Learning, pages ACM,

Variable sigma Gaussian processes: An expectation propagation perspective

Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu