Hyunjik Ki and Yee Whye Teh Appendix A Bayesian Inforation Criterion (BIC) The BIC is a odel selection criterion that is the arginal likelihood with a odel coplexity penalty: BIC = log p(y ˆθ) 1 2 p log(n) for observations y, nuber of observations N, axiu likelihood estiate (MLE) of odel hyperparaeters ˆθ, nuber of hyperparaeters p. It is derived as an approxiation to the log odel evidence log p(y). B Copositional Kernel Search Algorith Algorith 2: Copositional Kernel Search Algorith Input: data x 1,..., x n R D, y 1,..., y n R,base kernel set B Output: k, the resulting kernel For each base kernel on each diension, fit GP to data (i.e. optiise hyperparas by ML-II) and set k to be kernel with highest BIC. for depth=1:t (either fix T or repeat until BIC no longer increases) do Fit GP to following kernels and set k to be the one with highest BIC: (1) All kernels of for k + B where B is any base kernel on any diension (2) All kernels of for k B where B is any base kernel on any diension (3) All kernels where a base kernel in k is replaced by another base kernel C D Base Kernels LIN(x, x ) = σ 2 (x l)(x l) ( SE(x, x ) = σ 2 exp (x x ) 2 ) 2l 2 ( PER(x, x ) = σ 2 exp 2 sin2 (π(x x )/p) ) l 2 Matrix Identities Lea 1 (Woodbury s Matrix Inversion Lea). (A + UBV ) 1 = A 1 A 1 U(B 1 + V A 1 U) 1 V A 1 So setting A = σ 2 I (Nyströ) or σ 2 I + diag(k ˆK) (FIC) or σ 2 I + blockdiag(k ˆK) (PIC), U = Φ T = V, B = I, we get: (A + Φ Φ) 1 = A 1 A 1 Φ (I + ΦA 1 Φ ) 1 ΦA 1 Lea 2 (Sylvester s Deterinant Theore). det(i + AB) = det(i + BA) A R n B R n Hence: det(σ 2 I + Φ Φ) = (σ 2 ) n det(i + σ 2 Φ Φ) = (σ 2 ) n det(i + σ 2 ΦΦ ) = (σ 2 ) n det(σ 2 I + ΦΦ ) E Proof of Proposition 1 Proof. If PCG converges, the upper bound for NIP is. We showed in Section 4.1 that the convergence happened in only a few iterations. Moreover Cortes et al [7] shows that the lower bound for NIP can be rather loose in general. So it suffices to prove that the upper bound for NLD is tighter than the lower bound for NLD. Let (λ i ) N, (ˆλ i ) N be the ordered eigenvalues of K + σ 2 I, ˆK + σ 2 I respectively. Since K ˆK is positive sei-definite (e.g. [3]), we have λ i ˆλ i 2σ 2 i (using the assuption in the proposition). Now the slack in the upper bound is: 1 2 log det( ˆK + σ 2 I) ( 1 2 log det(k + σ2 I)) = 1 2 (log λ i log ˆλ i ) Hence the slack in the lower bound is: 1 2 log det(k + σ2 I) [ 1 2 log det( ˆK + σ 2 I) 1 ] Tr(K ˆK) 2σ2 = 1 (log λ i log 2 ˆλ i ) + 1 2σ 2 (λ i ˆλ i ) Now by concavity and onotonicity of log, and since ˆλ 2σ 2, we have: 1 2 log λ i log ˆλ i λ i ˆλ i 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (λ i ˆλ i ) 1 2 (λ i ˆλ i ) (log λ i log ˆλ i )
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes F Convergence of hyperparaeters fro optiising lower bound to optial hyperparaeters Note fro Figure 7 that the hyperparaeters found by optiising the lower bound converges to the hyperparaeters found by the GP when optiising the arginal likelihood, giving epirical evidence for the second clai in Section 3.3. G Parallelising SKC Note that SKC can be parallelised across the rando hyperparaeter initialisations, and also across the kernels at each depth for coputing the BIC intervals. In fact, SKC is even ore parallelisable with the kernel buffer: say at a certain depth, we have two kernels reaining to be optiised and evaluated before we can ove onto the next depth. If the buffer size is 5, say, then we can in fact ove on to the next depth and grow the kernel search tree on the top 3 kernels of the buffer, without having to wait for the 2 kernel evaluations to be coplete. This saves a lot of coputation tie wasted by idle cores waiting for all kernel evaluations to finish before oving on to the next depth of the kernel search tree. H Optiisation Since we wish to use the learned kernels for interpretation, it is iportant to have the hyperparaeters lie in a sensible region after the optiisation. In other words, we wish to regularise the hyperparaeters during optiisation. For exaple, we want the SE kernel to learn a globally sooth function with local variation. When naïvely optiising the lower bound, soeties the length scale and the signal variance becoes very sall, so the SE kernel explains all the variation in the signal and ends up connecting the dots. We wish to avoid this type of behaviour. This can be achieved by giving priors to hyperparaeters and optiising the energy (log prior added to the log arginal likelihood) instead, as well as using sensible initialisations. Looking at Figure 8, we see that using a strong prior with a sensible rando initialisation (see Appendix I for details) gives a sensible soothly varying function, whereas for all the three other cases, we have the length scale and signal variance shrinking to sall values, causing the GP to overfit to the data. Note that the weak prior is the default prior used in the GPstuff software [39]. Careful initialisation of hyperparaeters and inducing points is also very iportant, and can have strong influence the resulting optia. It is sensible to have the optiised hyperparaeters of the parent kernel in the search tree be inherited and used to initialise the hyperparaeters of the child. The new hyperparaeters of the child ust be initialised with rando restarts, where the variance is sall enough to ensure that they lie in a sensible region, but large enough to explore a good portion of this region. As for the inducing points, we want to spread the out to capture both local and global structure. Trying both K-eans and a rando subset of training data, we conclude that they give siilar results and resort to a rando subset. Moreover we also have the option of learning the inducing points. However, this will be considerably ore costly and show little iproveent over fixing the, as we show in Section 4. Hence we do not learn the inducing points, but fix the to a given randoly chosen set. Hence for SKC, we use axiu a posteriori (MAP) estiates instead of MLE for the hyperparaeters to calculate the BIC, since the priors have a noticeable effect on the optiisation. This is justified [23] and has been used for exaple in [11], where they argue that using the MLE to estiate the BIC for Gaussian ixture odels can fail due to singularities and degeneracies. I Hyperparaeter initialisation and priors Z N (, 1), T N(σ 2, I) is a Gaussian with ean and variance σ 2 truncated at the interval I then renoralised. Signal noise σ 2 =.1 exp(z/2) p(log σ 2 ) = N (,.2) LIN σ 2 = exp(v ) where V T N(1, [, ]), l = exp( Z 2 ) p(log σ 2 ) = logunif p(log l) = logunif SE l = exp(z/2), σ 2 =.1 exp(z/2) p(log l) = N (,.1), p(log σ 2 ) = logunif PER p in = log(1 ax(x) in(x) N ) (shortest possible period is 1 tie steps) p ax = log( ax(x) in(x) 5 ) (longest possible period is a fifth of the range of data set) l = exp(z/2), p = exp(p in + W ) or exp(p ax + U), σ 2 1 =.1 exp(z/2) w.p. 2 where W T N (.5, [, ]),
Hyunjik Ki and Yee Whye Teh Figure 7: Log arginal likelihood and hyperparaeter values after optiising the lower bound with ARD kernel on a subset of the Power plant data for different values of. This is copared against the GP values when optiising the true log arginal likelihood. Error bars show ean ± 1 standard deviation over 1 rando iterations. 1362 1361.5 W/ 2 1361 136.5 data strong prior, rand init weak prior, rand init strong prior, sall init weak prior, sall init 136 16 165 17 175 18 185 19 195 2 25 year Figure 8: GP predictions on solar data set with SE kernel for different priors and initialisations. U T N (.5, [, ])) p(log l) = t(µ =, σ 2 = 1, ν = 4), p(log p) = LN (p in.5,.25) or LN (p ax 2,.5) 1 w.p. 2 p(log σ 2 ) = logunif where LN (µ, σ 2 ) is log Gaussian, t(µ, σ 2, ν) is the student s t-distribution. J Coputation ties Look at Table 2. For all three data sets, the GP optiisation tie is uch greater than the su of the Var GP optiisation tie and the upper bound (NLD + NIP) evaluation tie for 8. Hence the savings in coputation tie for SKC is significant even for these sall data sets. Note that we show the lower bound optiisation tie against the upper bound evaluation tie instead of the evaluation ties for both, since this is what happens in SKC - the lower bound has to be optiised for each kernel, whereas the upper bound only has to be evaluated once. K Mauna and Solar plots and hyperparaeter values found by SKC Solar The solar data has 26 cycles over 285 years, which gives a periodicity of around 1.9615 years. Using SKC with = 4, we find the kernel: SE PER SE SE PER. The value of the period hyperparaeter in PER is 1.9569 years, hence SKC finds the periodicity to 3 s.f. with only 4 inducing points. The SE ter converts the global periodicity to local periodicity, with the extent of the locality governed by the length scale paraeter in SE, equal to 45. This is fairly large, but saller than the range of the doain (161-211), indicating that the periodicity spans over a long tie but isn t quite global. This is ost likely due to the static solar irradiance between the years 165-17, adding a bit of noise to the periodicities. Mauna The annual periodicity in the data and the linear trend with positive slope is clear. Linear regression gives us a slope of 1.5149. SKC with = 4 gives the kernel: SE + PER + LIN. The period hyperparaeter in PER takes value 1, hence SKC successfully finds the right periodicity. The offset l and agnitude σ 2 paraeters of LIN allow us to calculate the slope
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Table 2: Mean and standard deviation of coputation ties (in seconds) for full GP optiisation, Var GP optiisation(with and without learning inducing points), NLD and NIP (PCG using PIC preconditioner) upper bounds over 1 rando iterations. Solar Mauna Concrete GP 29.195 ± 5.143 164.8828 ± 58.7865 43.8233 ± 127.364 Var GP =1 7.259 ± 4.3928 6.117 ± 3.8267 5.4358 ±.7298 =2 8.3121 ± 5.4763 11.9245 ± 6.879 1.241 ± 2.519 =4 1.2263 ± 4.125 17.1479 ± 1.7898 19.6678 ± 4.3924 =8 9.6752 ± 6.5343 28.9876 ± 13.31 47.2225 ± 13.1955 =16 25.633 ± 8.7934 91.46 ± 39.849 158.9199 ± 18.1276 =32 76.3447 ± 2.3337 22.2369 ± 96.749 541.4835 ± 99.6145 NLD =1.19 ±.1.33 ±.2.113 ±.4 =2.26 ±.1.46 ±.2.166 ±.7 =4.43 ±.1.79 ±.3.286 ±.5 =8.84 ±.2.154 ±.4.554 ±.12 =16.188 ±.6.338 ±.7.1188 ±.3 =32.464 ±.32.789 ±.36.255 ±.74 NIP =1.474 ±.92.12 ±.296.2342 ±.26 =2.422 ±.13.1274 ±.674.1746 ±.45 =4.284 ±.75.846 ±.43.2345 ±.483 =8.199 ±.81.553 ±.25.2176 ±.376 =16.26 ±.53.432 ±.19.2136 ±.422 =32.25 ±.19.676 ±.668.2295 ±.433 Var GP, =1 23.4 ± 14.6 42. ± 33. 11. ± 32.5 learn IP =2 38.5 ± 17.5 62. ± 66. 7. ± 97. =4 124.7 ± 99. 32. ± 236. 37. ± 341.4 =8 268.6 ± 196.6 1935. ± 113. 666. ± 41. =16 1483.6 ± 773.8 148. ± 5991. 4786. ± 46.9 =32 2923.8 ± 1573.5 39789. ± 2387. 2596. ± 82.9 by the forula σ 2 (x l) [σ 2 (x l)(x l) + σ 2 ni] 1 y where σ 2 n is the noise variance in the learned likelihood. This forula is obtained fro the posterior ean of the GP, which is linear in the inputs for the linear kernel. This value aounts to 1.515, hence the slope found by SKC is accurate to 3 s.f. L Optiising the upper bound If the upper bound is tighter and ore robust with respect to choice of inducing points, why don t we optiise the upper bound to find hyperparaeters? If this were to be possible then we can axiise this to get an upper bound of the arginal likelihood with optial hyperparaeters. In fact this holds for any analytic upper bound whose value and gradients can be evaluated cheaply. Hence for any, we can find an interval that contains the true optiised arginal likelihood. So if this interval is doinated by an interval of another kernel, we can discard the kernel and there is no need to evaluate the bounds for bigger values of. Now we wish to use values of such that we can choose the right kernel (or kernels) at each depth of the search tree with inial coputation. This gives rise to an exploitation-exploration trade-off, whereby we want to balance between raising for tight intervals that allow us to discard unsuitable kernels whose intervals fall strictly below that of other kernels, and quickly oving on to the next depth in the search tree to search for finer structure in the data. The search algorith is highly parallelisable, and thus we ay raise siultaneously for all candidate kernels. At deeper levels of the search tree, there ay be too any candidates for siultaneous coputation, in which case we ay select the ones with the highest upper bound to get tighter intervals. Such attepts are listed below. Note the two inequalities for the NLD and NIP ters: 1 2 log det( ˆK + σ 2 I) 1 Tr(K ˆK) 2σ2 1 2 log det(k + σ2 I) 1 2 y ( ˆK + σ 2 I) 1 y 1 2 y (K+σ 2 I) 1 y 1 2 log det( ˆK + σ 2 I) (5) 1 2 y (K + (σ 2 + Tr(K ˆK))I) 1 y (6) Where the first two inequalities coe fro [3], the
Hyunjik Ki and Yee Whye Teh solar irradiance 1362 1361.8 1361.6 1361.4 1361.2 1361 136.8 136.6 to get the upper bound 1 2 log det( ˆK +σ 2 I) 1 2 y (K +(σ 2 +Tr(K ˆK))I) 1 y However we found that axiising this bound gives quite a loose upper bound unless = O(N). Hence this upper bound is not very useful. 136.4 136.2 M Rando Fourier Features CO2 level 41 4 39 38 37 36 35 34 33 32 136 16 165 17 175 18 185 19 195 2 25 year (a) Solar 31 195 196 197 198 199 2 21 22 year (b) Mauna Figure 9: Plots of sall tie series data: Solar and Mauna Rando Fourier Features (RFF) (a.k.a. Rando Kitchen Sinks) was introduced by [26] as a low rank approxiation to the kernel atrix. It uses the following theore Theore 3 (Bochner s Theore [29]). A stationary kernel k(d) is positive definite if and only if k(d) is the Fourier transfor of a non-negative easure. to give an unbiased low-rank approxiation to the Gra atrix K = E[Φ Φ] with Φ R N. A bigger lowers the variance of the estiate. Using this approxiation, one can copute deterinants and inverses in O(N 2 ) tie. In the context of kernel coposition in 2, RFFs have the nice property that saples fro the spectral density of the su or product of kernels can easily be obtained as sus or ixtures of saples of the individual kernels (see Appendix N). We use this later to give a eory-efficient upper bound on the log arginal likelihood in Appendix P. third inequality is a direct consequence of K ˆK being postive sei-definite, and the last inequality is fro Michalis Titsias lecture slides 3. Also fro 4, we have that 1 2 log det( ˆK + σ 2 I) + 1 2 α (K + σ 2 I)α α y is an upper bound α R N. Thus one idea of obtaining a cheap upper bound to the optiised arginal likelihood was to solve the following axiin optiisation proble: ax θ in 1 α R N 2 log det( ˆK+σ 2 I)+ 1 2 α (K+σ 2 I)α α y One way to solve this cheaply would be by coordinate descent, where one axiises with respect to θ fixing α, then iniises with respect to α fixing θ. However σ tends to blow up in practice. This is because the expression is O( log σ 2 + σ 2 ) for fixed α, hence axiising with respect to σ pushes it towards infinity. N Rando Features for Sus and Products of Kernels For RFF the kernel can be approxiated by the inner product of rando features given by saples fro its spectral density, in a Monte Carlo approxiation, as follows: k(x y) = e iv (x y) dp(v) R D p(v)e iv (x y) dv R D where φ(x) = = E p(v) [e iv x (e iv y ) ] = E p(v) [Re(e iv x (e iv y ) )] 1 Re(e iv x k (e iv y k ) ) k=1 = E b,v [φ(x) φ(y)] 2 (cos(v 1 x+b 1 ),..., cos(v x+b )) An alternative is to su the two upper bounds above with spectral frequencies v k iid saples fro p(v) and b k iid saples fro U[, 2π]. 3 http://www.aueb.gr/users/titsias/papers/titsiasnipsvar14.pdf Let k 1, k 2 be two stationary kernels, with respective
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes spectral densities p 1, p 2 so that k 1 (d) = a 1 ˆp 1 (d), k 2 (d) = a 2 ˆp 2 (d), where ˆp(d) := p(v)e iv d dv. We use this convention as the Fourier R D transfor. Note a i = k i (). (k 1 + k 2 )(d) = a 1 p 1 (v)e iv d dv + a 2 p 2 (v)e iv d dv = (a 1 + a 2 ) p ˆ + (d) where p + (v) = a1 a 1+a 2 p 1 (v) + a2 a 1+a 2 p 2 (v), a ixture of p 1 and p 2. So to generate RFF for k 1 + k 2, generate v p + by generating v p 1 w.p. 1 a a 1+a 2 and v a p 2 w.p. 2 a 1+a 2 Now for the product, suppose (k 1 k 2 )(d) = a 1 a 2 ˆp 1 (d) ˆp 2 (d) = a 1 a 2 ˆp (d) Then p (d) is the inverse fourier transfor of ˆp 1 ˆp 2, which is the convolution p 1 p 2 (d) := p R D 1 (z)p 2 (d z)dz. So to generate RFF for k 1 k 2, generate v p by generating v 1 p 1, v 2 p 2 and setting v = v 1 + v 2. This is not applicable for non-stationary kernels, such as the linear kernel. We deal with this proble as follows: Suppose φ 1, φ 2 are rando features such that k 1 (x, x ) = φ 1 (x) φ 1 (x ), φ 2 (x) φ 2 (x ), φ i : R D R. It is straightforward to verify that P An upper bound to NLD using Rando Fourier Features Note that the function f(x) = log det(x) is convex on the set of positive definite atrices [6]. Hence by Jensen s inequality we have, for Φ Φ an unbiased estiate of K: 1 2 log det(k + σ2 I) = f(k + σ 2 I) = f(e[φ Φ + σ 2 I]) E[f(Φ Φ + σ 2 I)] Hence 1 2 log det(φ Φ + σ 2 I) is a stochastic upper bound to NLD that can be calculated in O(N 2 ). An exaple of such an unbiased estiator Φ is given by RFF. Q Further Plots (k 1 + k 2 )(x, x ) = φ + (x) φ + (x ) (k 1 k 2 )(x, x ) = φ (x) φ (x ) where φ + ( ) = (φ 1 ( ), φ 2 ( ) ) and φ ( ) = φ 1 ( ) φ 2 ( ). However we do not want the nuber of features to grow as we add or ultiply kernels, since it will grow exponentially. We want to keep it to be features. So we subsaple entries fro φ + (or φ ) and scale by factor 2 ( for φ ), which will still give us unbiased estiates of the kernel provided that each ter of the inner product φ + (x) φ + (x ) (or φ (x) φ (x )) is an unbiased estiate of (k 1 + k 2 )(x, x )(or (k 1 k 2 )(x, x )). This is how we generate rando features for linear kernels cobined with other stationary kernels, using the features φ(x) = σ (x,..., x). O Spectral Density for PER Fro [35], we have that the spectral density of the PER kernel is: I n (l 2 ( ) exp(l 2 ) δ v 2πn ) p n= where I is the odified Bessel function of the first kind.
Hyunjik Ki and Yee Whye Teh negative log ML energy 2 15 1 5 fullgp Var LB UB NLD 35 3 25 2 15 Nystro NLD UB RFF NLD UB NIP -5-15 -25-35 (a) Mauna: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log energy ML 2 15 1 5 fullgp Var LB UB NLD 3 25 2 Nystro NLD UB RFF NLD UB NIP (b) Mauna: learn inducing points Figure 1: Sae as 1a and 1b but for Mauna Loa data. CG PCG Nys PCG FIC PCG PIC negative log ML energy -6-8 fullgp Var LB UB NLD 1 8 6 4 2 Nystro NLD UB RFF NLD UB NIP -35-45 -55 (a) Concrete: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log ML energy -6-7 -8-9 fullgp Var LB UB NLD 11 1 9 8 7 6 5 4 3 Nystro NLD UB 2 RFF NLD UB NIP -35-45 CG PCG Nys -55 PCG FIC PCG PIC 1-6 (b) Concrete: learn inducing points Figure 11: Sae as 1a and 1b but for Concrete data. SE PER LIN (SE)*SE (SE)*PER (SE)+SE (PER)+SE (PER)+PER (PER)*PER (PER)+LIN (LIN)+LIN (LIN)+SE (PER)*LIN (LIN)*LIN (SE)*LIN Figure 12: Kernel search tree for SKC on solar data up to depth 2. We show the upper and lower bounds for different nubers of inducing points.
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Figure 13: Sae as Figure 12 but for Mauna data. Figure 14: Sae as Figure 12 but for concrete data and up to depth 1.