A Sparse Covariance Function for Exact Gaussian Process Inference in Large Datasets

Size: px

Start display at page:

Download "A Sparse Covariance Function for Exact Gaussian Process Inference in Large Datasets"

Myron Bennett
5 years ago
Views:

1 A Covariance Function for Exact Gaussian Process Inference in Large Datasets Arman ekumyan Austraian Centre for Fied Robotics The University of Sydney NSW 26, Austraia Fabio Ramos Austraian Centre for Fied Robotics The University of Sydney NSW 26, Austraia Abstract Despite the success of Gaussian processes GPs in modeing spatia stochastic processes, deaing with arge datasets is sti chaenging. The probem arises by the need to invert a potentiay arge covariance matrix during inference. In this paper we address the compexity probem by constructing a new stationary covariance function ercer kerne that naturay provides a sparse covariance matrix. The sparseness of the matrix is defined by hyperparameters optimised during earning. The new covariance function enabes exact GP inference and performs comparativey to the squared-exponentia one, at a ower computationa cost. This aows the appication of GPs to arge-scae probems such as ore grade prediction in mining or 3D surface modeing. Experiments show that using the proposed covariance function, very sparse covariance matrices are normay obtained which can be effectivey used for faster inference and ess memory usage. 1 Introduction Gaussian processes GPs are a usefu and powerfu too for regression in supervised machine earning [Rasmussen and Wiiams, 26]. The range of appications incudes geophysics, mining, hydroogy, reservoir engineering and robotics. Despite its increasing popuarity, modeing argescae spatia stochastic processes is sti chaenging. The difficuty comes from the fact that inference in GPs is usuay computationay expensive due to the need to invert a potentiay arge covariance matrix during inference time, which has O N 3 cost. For probems with thousands of observations, exact inference in norma GPs is intractabe and approximation agorithms are required. ost of the approximation agorithms empoy a subset of points to approximate the posterior distribution at a new point given the training data and hyper-parameters. These approximations rey on heuristics to seect the subset of points [Lawrence et a., 23; Seeger et a., 23], or use pseudo targets obtained during the optimisation of the og-margina ikeihood of the mode [Sneson and Ghahramani, 26]. In this work, we address the compexity probem differenty. Instead of reying on sparse GP approximations, we propose a new covariance function which provides intrinsicay sparse covariance matrices. This aows exact inference in GPs using conventiona methods. As the new sparse covariance function can be mutipied by any other vaid covariance function and the resut is a sparse covariance matrix, a ot of fexibiity is given to practitioners to accuratey mode their probems whie sti preserving sparseness properties. We ca the GPs constructed using our sparse covariance function Exact Gaussian Processes ESGPs. The main idea behind is the formuation of a vaid and smooth covariance function whose output equas to zero whenever the distance between input observations is arger than a hyperparameter. As with other hyper-parameters, this can be estimated by maximising the margina ikeihood to better mode the properties of the data such as smoothness, characteristic ength-scae, and noise. Additionay, the proposed covariance function in much resembes the popuar squared exponentia in terms of smoothness, being four times continuousy differentiabe. We empiricay compare ESGP with oca approximation techniques and demonstrate how other covariance functions can be integrated in the same framework. Our method resuts in very sparse covariance matrices up to 9% of the eements are zeros in in-ground grade estimation probems which requires significanty ess memory whie providing simiar performance. This paper is organised as foows. In Section 2 we review the basics of GP regression and introduce notation. Section 3 summarises previous work on approximate inference with GPs. Section 4 presents our new intrinsicay sparse covariance function and its main properties. We evauate the framework providing experimenta resuts in both artificia and rea data in Section 6. Finay, Section 7 concudes the paper and discusses further deveopments. 2 Gaussian Processes In this section we briefy review Gaussian Processes for regression and introduce notation. We consider the supervised earning probem where given a training set D = {x i,y i } N i=1 consisting of N input points x i R D and the corresponding outputs y i R the objective is to compute the predictive distribution f x at a new test point x. A Gaussian process mode paces a mutivariate Gaussian distribution over the space of function variabes fx mapping input to output spaces. The mode is specified by defining a mean function 1936

2 mx and the covariance function kx, x resuting in the Gaussian process written as fx GPmx,kx, x. Denoting groups of these points as X, f, y = {x i }, {f i }, {y i } N i=1 for the training set and X, f, y = {x,i }, {f,i }, {y,i } N i=1 for the testing points, the joint Gaussian distribution with mx =is [ f f ] N [ KX, X KX, X, KX,X KX,X ], 1 where N μ, Σ is a mutivariate Gaussian distribution with mean μ and covariance Σ, and K is used to denote the covariance matrix computed between a points in the set. If we assume observations with Gaussian noise ɛ and variance σ 2 such that y = fx+ɛ, the joint distribution becomes [ ] [ ] y KX, X+σ N, 2 I KX, X. f KX,X KX,X 2 A popuar choice for the covariance function is the squared exponentia used in this paper for comparisons in the experiment section: kx, x =σ 2 f exp 1 2 x x T x x 3 with = diag 2 where is a vector of positive numbers representing the ength-scaes in each dimension. 2.1 Inference for New Points By conditioning on the observed training points, the predictive distribution can be obtained as p f X,X,y =N μ, Σ, 4 where μ = K X,X [ K X, X+σ 2 I ] 1 y Σ = K X,X K X,X [ K X, X+σ 2 I ] 1 K X, X +σ 2 I. 5 From Equation 5, it can be observed that the predictive mean is a inear combination of N kerne functions each centred on a training point, μ = N i=1 α ik x i, x, where α = K X, X+σ 2 I 1 y. A GP is aso a best unbiased inear estimator [Cressie, 1993; Kitanidis, 1997] in the mean squared error sense. During inference, most of the computationa cost takes pace whie computing the inversion in Equation 5, which is O N 3 if impemented naïvey. 2.2 Learning Hyper-Parameters Commony, the covariance function kx, x is parametrised by set of hyper-parameters θ, and we can write k x, x ; θ. These parameters aow for more fexibiity in modeing the properties of the data. Thus, earning a GP mode is equivaent to determining the hyper-parameters of the covariance function from some training dataset. In a Bayesian framework this can be performed by maximising the og of the margina ikeihood w.r.t. θ: og p y X, θ = 1 2 yt Ky 1 y 1 2 og K y N 2 og 2π 6 where K y = KX, X +σ 2 I is the covariance matrix for the targets y. The margina ikeihood has three terms from eft to right, the first accounts for the data fit; the second is a compexity penaty term encoding the Occam s Razor principe and the ast is a normaisation constant. Eq. 6 is a non-convex function on the hyper-parameters θ and therefore ony oca maxima can be obtained. In practice, this is not a major issue since good oca maxima can be obtained with gradient descent techniques by using mutipe starting points. However, this requires the computation of partia derivatives resuting in: og p y X, θ = 1 θ j 2 yt 1 K K K 1 y 1 θ j 2 tr K 1 K θ 7 Note that this expression requires the computation of partia derivatives of the covariance function w.r.t θ. 3 Reated Work Recenty, there have been severa methods proposed to tacke the probem of GP inference in arge datasets. However, most of these approaches rey on approximation techniques. A common and simpe procedure is to seect a subset of data points and perform inference using ony these points. This is equivaent to ignoring part of the data which makes the seection of the subset very important. In [Lawrence et a., 23], the seection of points is based on the differentia entropy. Simiary, [Seeger et a., 23] suggest the use of another information theory quantity, the information gain. Another interesting procedure is to seect a subset of data points to act as an inducing set and project this set up to a the data points avaiabe. This is known as sparse GP approximation [Wiiams and Seeger, 21; Smoa and Bartett, 21; Candea and Rasmussen, 25] which usuay performs better than simpy seecting a subset of data points. However, the definition of the inducing set is difficut and can invove non-convex optimisations [Sneson and Ghahramani, 26]. Loca methods have been appied in geostatistics for a ong time [Wackernage, 23]. The idea is to perform inference by evauating the covariance function ony at points in the neighbourhood of a query point. This method can be effective but the definition of the neighbourhood is crucia. Our method is inspired by this approach but rather than defining the neighbourhood manuay, we obtain it automaticay during earning. An interesting idea on combining oca and goba methods such as the sparse Gaussian process was proposed in [Sneson and Ghahramani, 27]. We compare our method to theirs in Section 6. This work differs from other studies by not addressing the GP inference probem through an approximation technique. Rather, it proposes a new covariance function that naturay. 1937

3 generates sparse covariance matrices. This idea was used in [Wendand, 25] with piecewise poynomias but extensions to mutipe dimensions is difficut due to the need to guarantee positive definiteness. A simiar formuation to ours was proposed in [Storkey, 1999]. However, there is no hyperparameter earning and the main properties are not anaysed. To the best of our knowedge, this work is the first to demonstrate with rea exampes how the compexity probem can be addressed through the construction of a new sparse covariance function aowing for exact GP inference in arge datasets. 4 Exacty Gaussian Processes For very arge datasets, the inversion or even storage of a fu matrix KX, X+σ 2 I can be prohibitive. In geoogy probems for exampe, it is not uncommon to have datasets with 1K points or more. To dea with such arge probems whie sti being abe to perform exact inference in the GP mode, we deveop the covariance function beow. First, note that the mean prediction in Eq. 5 can be rewritten as a inear combination of N evauations of the covariance function, each = N i=1 α ik x, x i, one centred on a training point, μ where α = K X, X+σ 2 I 1 y. To avoid the inversion of the fu matrix, we can instead deveop a covariance function whose output vanishes outside some region R, so that kx, x i =when x i is outside a region R. In this way, ony a subset of α woud need to be computed which effectivey means that ony few coumns of K X, X+σ 2 I 1 need to be computed, significanty reducing the computationa and storage costs as R diminishes. As we sha see, the region can be specified automaticay during earning. 4.1 Intrinsicay Covariance Function The covariance function we are ooking for must vanish out of some finite region R for exact sparse GP regression. It must produce smooth curves but it shoud not be infinitey differentiabe so that it can be appicabe to probems with some discontinuities. For our derivation, the function g x = cos 2 πx H.5 x was chosen, which due to cos 2 πx = cos 2πx+1/2 is actuay the cosine function shifted up, normaised and set to zero out of the interva x.5,.5. The cosine function was seected as the basis function due to the foowing reasons: 1 it is anayticay we tractabe; 2 integras with finite imits containing combinations of its basic form can be cacuated in cosed form; and 3 the cosine function usuay provides good approximations for different functions, being the core eement for Fourier anaysis. Here and afterwards H represents the Heaviside unit step function. As it stands, the chosen basis function g x is smooth on the whoe rea axis, vanishes out of the interva x.5,.5 and has discontinuities in the second derivative. To derive a vaid covariance function we conduct cacuations anaogous to presented in [Rasmussen and Wiiams, 26]. Using the transfer function h x; u =g x u the foowing 1D covariance function is obtained: x x x, x =σ h ; u h ; u du. 8 k1x,x sparse =1 sparse =2 sparse =3 sqexp =.75 = x x Figure 1: Pot showing the output of the covariance function for different vaues of Δx. Due to the chosen form of the basis functions, the integra in Eq. 8 can be anayticay evauated see Appendix A for detais to resut in: k 1 x, x [ ;, σ = 2+cos2π σ d 3 1 d + 1 2π sin ] 2π d if d< if d 9 where σ > is a constant coefficient, >is a given scae and d is the distance between the points: d = x x. 1 From Eq. 8 it foows that for any points x i and any rea numbers a i where i = 1, 2,..., n the inequaity n n xi 2 a i a j x i,x j =σ a i h i,j=1 ; u du i=1 hods, so that the constructed covariance function is positive semi-definite. Based on Eqs. 9-1 we cacuate that k d= = k d = 2 k d= d 2 = 3 k d= d 3 = 4 k d= d 4 =, d= 5 k d 5 = 4π 4 11 d= which shows that the covariance function is continuous and has continuous 4th derivative at d =. The function Δx, ;, 1 is compared with squared exponentia in Figure 1. Note that it foows the squared exponentia covariance function cosey but vanishes when Δx. 4.2 Extending to utipe Dimensions This covariance function can be extended to mutipe dimensions in the foowing ways: 1. Using direct products for a axes: k 1 x, D x ;,σ =σ x i,x i; i, 1 12 i=1 where D is the dimensionaity of the points and is the vector of the characteristic engths, = 1, 2,..., D T. 2. Using ahaanobis distance: = 3 =

4 k 2 { r; [ σ, Ω = ] σ 2+cos2πr 3 1 r+ 1 2π sin 2πr if r<1 if r 1 13 where σ >, Ω is positive semi-definite and r = x x T Ωx x, Ω. 14 After this point we wi frequenty use the short notation k for the function k 2 r; σ, Ω. 4.3 Important properties of the new covariance function The deveoped muti-dimensiona covariance function in both forms k 1 and k2 has the foowing remarkabe properties: 1. It vanishes out of some finite region R: { } R 1 = r R D : k 1 r 15 { } R 2 = r R D : k 2 r 16 Sizes of the regions R 1 and R 2 can be controed via the characteristic engths i. oreover, these sizes can be earnt from data by maximising the margina ikeihood as common in the GP framework. 2. A the derivatives up to and incuding the fourth order derivative are continuous, which guarantees mean square differentiabiity up to the corresponding order of the sampe curves in GPs. There are discontinuities for the fifth order gradient. 3. The region R 1 is a D dimensiona rectange and the region R 2 is a D dimensiona eipsoid. 4. In the case of 1 dimension k 1 and k2 become identica. 5. The covariance function is anisotropic, i.e. has different inner properties for different dimensions. 6. This covariance function eads to sparse covariance matrices and aows GP inference in arge datasets without the need for approximations. 5 Partia Derivatives for Learning Learning the GP requires the computation of the covariance function partia derivatives w.r.t. the hyper-parameters Eq. 7. Based on Eqs.9, 12, the foowing expressions for the partia derivatives of k 1 x, x ;,σ can be cacuated: k 1 = 1 k 1 17 σ σ k 1 x i,x i ; i,σ [ π 1 d i i = 4σ i 3 cos π d i i k 1 x i,x i ; i,σ + sin π d i i d i i 2 ] sin π d i i 18 where i =1, 2,..., D. For the second case, if Ω is diagona and positive definite, it can be expressed via the characteristic engths as foows: 1 Ω = diag 1 2, 1 2 2,..., 1 D 2 From Eqs. 14, 19 it foows that r = D k=1. 19 xk x 2 k. 2 Based on Eqs. 13, 19-2 the foowing gradient components of this muti-dimensiona covariance function k can be obtained: k σ = 2 + cos 2πr 3 k 1 r+ 1 sin 2πr,if r<1 2π 21 k = 4σ [π 1 r cos πr + sin πr] j 3 sin πr 1 xj x 2 j, if <r<1. 22 r j grad k =,ifr In Eq. 22, r is in the denominator, so that direct cacuations cannot be carried out using Eq. 22 when r =. However, using the equaity sin μr im = μ 24 r r one can directy show that k r; σ, Ω 4σ im = im 1 xj x 2 r j r 3 π2 j =. j j 25 Based on Eq. 25 it must be taken directy k j =, j =1, 2,..., D. 26 r= Eqs , 26 fuy define the gradient of the new covariance function k at every point and can be directy used in the earning procedure. 6 Experiments This section provides empirica comparisons between exact sparse GP, conventiona GP with squared exponentia and approximation procedures. 6.1 Artificia Dataset In this experiment we compare the exact GP with the proposed covariance function against the approach proposed in [Sneson and Ghahramani, 27]. The data is essentiay the same as presented in the experiment section in [Sneson and Ghahramani, 27]. As can be observed in Figure 2, the j 1939

5 SqExp Normaized SE mean and std Figure 2: Comparison between exact sparse GP and the oca and goba approximation. FITC stands for Fuy independent training conditiona approximation. Detais can be found in [Sneson and Ghahramani, 27]. Note that exact sparse GP provides a much smoother curve. sparse covariance function provides a much smoother prediction for the underying function than the combination of oca and goba approximations. This exampe shows quaitativey that in some situations approximation methods can ead to discontinuities. The same does not occur in the exact sparse GP inference. 6.2 Rainfa Dataset In this experiment we compare the exact sparse GP with the exact GP with the squared exponentia covariance function and the covariance function obtained by the mutipication of both of them. The dataset used is a popuar dataset in geostatistics for comparing inference procedures and is known as the Spatia Interpoation Comparison dataset [Dubois et a., 23]SIC 1. The dataset consists of 467 points measuring rainfa in 2D space. We divide these points into two sets, inference and testing. The inference set contains the points used to perform inference on the testing points. For each case the experiment is repeated 15 times with randomy seected inference and testing sets. Figure 3 shows the normaised squared error for the different covariance functions and the standard deviation one sigma for each part of the bar as a function of the number of inference points. As the number of inference points increases, so does the size of the covariance matrix. The resuts demonstrate that very simiar errors are obtained for the different covariances. However, the sparse covariance function produces sparse matrices thus requiring much ess foating point operations. Figure 4 shows the percentage of zeros in the covariance matrix as a function of the number of inference points. As can be observed, the percentage of zeros grows quicky as more inference points are added and it reaches its imit around 1 inference points. Athough the percentage of zeros reaches its imit, the number of zeros in the covariance matrix continues to increase because the size of the covariance matrix increases with the 1 The SIC dataset can be downoaded at: Number of points used for inference Figure 3: Normaised ean Square Error for the SIC dataset. The error is essentiay the same for both covariance functions, with the exact sparse performing sighty worse with fewer inference points but simiar with more inference points at a much ower computation cost. number of inference points. Aso worth noticing is the performance of the mutipication between the two covariance functions. The error is essentiay the same as for the sparse covariance function aone but the percentage of zeros is significanty smaer. This exampe demonstrates the benefits of the proposed approach in reducing storage and number of operations for simiar accuracy. 6.3 Iron Ore Dataset In this dataset the goa is to estimate iron ore grade in 3D space over a region of 2.5 cubic kiometres. The dataset is from an open pit iron mine in Western Austraia. About 17K sampes were coected and the iron concentration measured with X-Ray systems. We divide the 17K dataset points into inference and testing sets. The inference set is taken arbitrariy from the dataset and from the remaining points the testing points are arbitrariy chosen. The experiments are repeated 5 times. Figure 5 shows the normaised mean squared error and the standard deviation one sigma for each part of the bar in the cases of squared exponentia, sparse covariance functions and their product. The resuts demonstrate that a the three ead to simiar errors with the sparse covariance function performing sighty better with the increase of the number of inference points. Figure 6 shows that athough they resut in simiar errors, the sparse and the product ead to about 48% of zeros in the covariance matrix, which is 12K to 12 ces exacty equa to zero when the number of inference points varies from 5 to 5. This exampe demonstrates that the proposed method provides greater savings for bigger datasets. 6.4 Speed Comparison This experiment demonstrates the computationa gains in using the proposed sparse covariance function. Synthetic datasets containing 1, 2, 3 and 4 points were generated by samping a poynomia function with white 194

6 6 5 Percentage of number of zeros in K mean and std SqExp Number of inference points Figure 4: Percentage of zeros in the covariance matrix as a function of the number of inference points. Normaized SE mean and std SqExp Number of points used for inference Figure 5: Normaised ean Square Error for the iron ore dataset. The performance of the sparse covariance function is equivaent to squared exponentia. Due to the computationa cost using the squared exponentia, we stop the experiment with 5 inference points athough the sparse covariance function coud accommodate a the 17K points. Percentage of zeros in K mean and std SqExp: Exact Number of inference points Figure 6: Percentage of zeros for the iron ore grade estimation probem. noise. We compare the speed of a GP with the sparse covariance function to a GP with the squared exponentia covariance function for different ength scaes and corresponding number of non-zero eements in the covariance matrix. The resuts are presented in Figure 7. The code is impemented in atab and uses the sparse matrix impementation package provided. Further gains coud be obtained in more efficient sparse matrix packages. As the number of points in the datasets increases, the speed up becomes more evident. With 4 points, the sparse covariance function is faster than the squared exponentia for up to 7% of non-zeros eements in the covariance matrix. After this point, the computationa cost of the sparse matrix impementation becomes dominant. As in genera the sparse covariance function provides covariance matrices much sparser, speed gains can be quite substantia in addition to storage gains. 7 Concusions This paper proposed a new covariance function constructed upon the cosine function for anaytica tractabiity that naturay provides sparse covariance matrices. The sparseness of the data is controed by a hyper-parameter that can be earnt from data. The sparse covariance function enabes exact inference in GPs even for arge datasets, providing both storage and computationa benefits. Athough the main focus of this paper was on GPs, it is important to emphasise that the covariance function proposed is aso a ercer kerne and therefore can be appied to kerne machines such as support vector machines, kerne principa component anaysis and others [Schökopf and Smoa, 22]. The use of the sparse covariance function in other kerne methods is objective of our future work. Acknowedgements We thank Edward Sneson for providing the pictures for the comparison with oca and goba sparse Gaussian process approximation. This work has been supported by the Rio Tinto 1941

7 Normaized computationa time SqExp: 1 points : 1 points SqExp: 2 points : 2 points SqExp: 3 points : 3 points SqExp: 4 points : 4 points Percentage of non zero eements for the sparse covariance function Figure 7: Normaised computationa time versus number of non-zero eements for the sparse covariance function in datasets of different sizes. The performance of the nonsparse squared exponentia covariance function is aso incuded for comparison. Note that the computationa gains increase with the size of the datasets. Centre for ine Automation and the ARC Centre of Exceence programme, funded by the Austraian Research Counci ARC and the New South Waes State Government. References [Candea and Rasmussen, 25] J. Quiñonero Candea and C. E. Rasmussen. A unifying view of sparse Gaussian process regression. Journa of achine Learning Research, 6: , 25. [Cressie, 1993] N. Cressie. Statistics for Spatia Data. Wiey, [Dubois et a., 23] G. Dubois, J. aczewski, and. De Cort. apping radioactivity in the environment. spatia interpoation comparison In Office for Officia Pubications of the European Communities, Luxembourg, 23. [Kitanidis, 1997] P. K. Kitanidis. Introdcution to Geostatistics: Appications in Hydrogeoogy. Cambridge University Press, [Lawrence et a., 23] N. Lawrence,. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The information vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neura Information Processing Systems 15, pages IT Press, 23. [Rasmussen and Wiiams, 26] C. E. Rasmussen and C. K. I. Wiiams. Gaussian Processes for achine Learning. IT Press, 26. [Schökopf and Smoa, 22] B. Schökopf and A. J. Smoa. Learning with Kernes. IT Press, 22. [Seeger et a., 23]. Seeger, C. K. I. Wiiams, and N. Lawrence. Fast forward seection to speed up sparse gaussian process regression. In AISTATS, 23. [Smoa and Bartett, 21] A. Smoa and P. Bartett. greedy gaussian process regression. In Advances in Neura Information Processing Systems 13, pages IT Press, 21. [Sneson and Ghahramani, 26] E. Sneson and Z. Ghahramani. gaussian processes using pseudo-inputs. In Advances in Neura Information Processing Systems 18, pages IT press, 26. [Sneson and Ghahramani, 27] E. Sneson and Z. Ghahramani. Loca and goba sparse Gaussian process approximations. In AISTATS, 27. [Storkey, 1999] A. J. Storkey. Truncated covariance matrices and toepitz methods in gaussian processes. In 9th Internationa Conference on Artificia Neura Networks, [Wackernage, 23] H. Wackernage. utivariate Geostatistics. Springer, 23. [Wendand, 25] H. Wendand. Scattered Data Approximation. Cambridge onographs on Appied and Computationa athematics. Cambridge University Press, 25. [Wiiams and Seeger, 21] C. K. I. Wiiams and. Seeger. Using the Nyström method to speed up kerne machines. In Advances in Neura Information Processing Systems 13, pages IT Press, 21. A Detaied Derivation The covariance function is constructed by evauating the integra Z x «x = σ g u g u du 27 where g x = cos 2 πx H.5 x 28 and H x is the Heaviside unit step function. From Eq. 28 it foows that g x =if x.5 so that from Eq. 27 we have = if x x. 29 If x x <then the integrand of Eq. «27 is nonzero ony when maxx,x u.5, minx,x +.5 therefore = σ Z minx,x maxx,x 1 2 cos 2 π x «πu cos 2 π x πu du 3 Using the identities cos 2 x = cos 2x+1/2 and 2 cos x cos y = cos x y + cos x + y the indefinite integra of the integrand of Eq. 3 can be anayticay cacuated: Z «J u = πu du 2 + cos 2π x x = u+ 1 π 8 4π cos x «x sin 2πu π x + «x + 1 4πu 32π sin 2π x + «x 31 From Eqs. 31 and 3 one has that if x x <then» min x, x = σ J + 1 «max x, x J 1 « which after agebraic manipuations becomes " 2 + cos d `2π = σ 1 d « π sin cos 2 π x πu cos 2 π x 2π d «# 33 where d = x x and σ =3σ/8. Finay, combining Eqs. 29 and 33, we obtain Eq

Stochastic Variational Inference with Gradient Linearization

Stochastic Variational Inference with Gradient Linearization Stochastic Variationa Inference with Gradient Linearization Suppementa Materia Tobias Pötz * Anne S Wannenwetsch Stefan Roth Department of Computer Science, TU Darmstadt Preface In this suppementa materia,