Appendix. Bayesian Information Criterion (BIC) E Proof of Proposition 1. Compositional Kernel Search Algorithm. Base Kernels.

Similar documents
arxiv: v2 [stat.ml] 14 Feb 2018 Abstract

Machine Learning Basics: Estimators, Bias and Variance

1 Bounding the Margin

COS 424: Interacting with Data. Written Exercises

CS Lecture 13. More Maximum Likelihood

Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Feature Extraction Techniques

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Block designs and statistics

Detection and Estimation Theory

Kernel Methods and Support Vector Machines

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Ch 12: Variations on Backpropagation

Bootstrapping Dependent Data

PAC-Bayes Analysis Of Maximum Entropy Learning

Probability Distributions

Non-Parametric Non-Line-of-Sight Identification 1

Sharp Time Data Tradeoffs for Linear Inverse Problems

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

A Theoretical Analysis of a Warm Start Technique

A Note on the Applied Use of MDL Approximations

Combining Classifiers

List Scheduling and LPT Oliver Braun (09/05/2017)

A Simple Regression Problem

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Soft-margin SVM can address linearly separable problems with outliers

1 Generalization bounds based on Rademacher complexity

Stochastic Subgradient Methods

Boosting with log-loss

Topic 5a Introduction to Curve Fitting & Linear Regression

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Computable Shell Decomposition Bounds

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

3.3 Variational Characterization of Singular Values

Testing equality of variances for multiple univariate normal populations

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Homework 3 Solutions CSE 101 Summer 2017

1 Proof of learning bounds

Chapter 6 1-D Continuous Groups

Support Vector Machines. Maximizing the Margin

Chaotic Coupled Map Lattices

Support Vector Machines MIT Course Notes Cynthia Rudin

On Constant Power Water-filling

Computational and Statistical Learning Theory

In this chapter, we consider several graph-theoretic and probabilistic models

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The Weierstrass Approximation Theorem

Lecture 21. Interior Point Methods Setup and Algorithm

Pattern Recognition and Machine Learning. Artificial Neural networks

EE5900 Spring Lecture 4 IC interconnect modeling methods Zhuo Feng

Computational and Statistical Learning Theory

Computable Shell Decomposition Bounds

lecture 36: Linear Multistep Mehods: Zero Stability

PHY 171. Lecture 14. (February 16, 2012)

Analyzing Simulation Results

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Lower Bounds for Quantized Matrix Completion

Tracking using CONDENSATION: Conditional Density Propagation

When Short Runs Beat Long Runs

A remark on a success rate model for DPA and CPA

Support Vector Machines. Goals for the lecture

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

SPECTRUM sensing is a core concept of cognitive radio

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all

Principal Components Analysis

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS

arxiv: v1 [cs.ds] 3 Feb 2014

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Pattern Recognition and Machine Learning. Artificial Neural networks

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Ph 20.3 Numerical Solution of Ordinary Differential Equations

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

ma x = -bv x + F rod.

3.8 Three Types of Convergence

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Biostatistics Department Technical Report

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Solutions of some selected problems of Homework 4

Automated Frequency Domain Decomposition for Operational Modal Analysis

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

a a a a a a a m a b a b

Bayes Decision Rule and Naïve Bayes Classifier

A note on the multiplication of sparse matrices

OBJECTIVES INTRODUCTION

An improved self-adaptive harmony search algorithm for joint replenishment problems

Pattern Recognition and Machine Learning. Artificial Neural networks

Transcription:

Hyunjik Ki and Yee Whye Teh Appendix A Bayesian Inforation Criterion (BIC) The BIC is a odel selection criterion that is the arginal likelihood with a odel coplexity penalty: BIC = log p(y ˆθ) 1 2 p log(n) for observations y, nuber of observations N, axiu likelihood estiate (MLE) of odel hyperparaeters ˆθ, nuber of hyperparaeters p. It is derived as an approxiation to the log odel evidence log p(y). B Copositional Kernel Search Algorith Algorith 2: Copositional Kernel Search Algorith Input: data x 1,..., x n R D, y 1,..., y n R,base kernel set B Output: k, the resulting kernel For each base kernel on each diension, fit GP to data (i.e. optiise hyperparas by ML-II) and set k to be kernel with highest BIC. for depth=1:t (either fix T or repeat until BIC no longer increases) do Fit GP to following kernels and set k to be the one with highest BIC: (1) All kernels of for k + B where B is any base kernel on any diension (2) All kernels of for k B where B is any base kernel on any diension (3) All kernels where a base kernel in k is replaced by another base kernel C D Base Kernels LIN(x, x ) = σ 2 (x l)(x l) ( SE(x, x ) = σ 2 exp (x x ) 2 ) 2l 2 ( PER(x, x ) = σ 2 exp 2 sin2 (π(x x )/p) ) l 2 Matrix Identities Lea 1 (Woodbury s Matrix Inversion Lea). (A + UBV ) 1 = A 1 A 1 U(B 1 + V A 1 U) 1 V A 1 So setting A = σ 2 I (Nyströ) or σ 2 I + diag(k ˆK) (FIC) or σ 2 I + blockdiag(k ˆK) (PIC), U = Φ T = V, B = I, we get: (A + Φ Φ) 1 = A 1 A 1 Φ (I + ΦA 1 Φ ) 1 ΦA 1 Lea 2 (Sylvester s Deterinant Theore). det(i + AB) = det(i + BA) A R n B R n Hence: det(σ 2 I + Φ Φ) = (σ 2 ) n det(i + σ 2 Φ Φ) = (σ 2 ) n det(i + σ 2 ΦΦ ) = (σ 2 ) n det(σ 2 I + ΦΦ ) E Proof of Proposition 1 Proof. If PCG converges, the upper bound for NIP is. We showed in Section 4.1 that the convergence happened in only a few iterations. Moreover Cortes et al [7] shows that the lower bound for NIP can be rather loose in general. So it suffices to prove that the upper bound for NLD is tighter than the lower bound for NLD. Let (λ i ) N, (ˆλ i ) N be the ordered eigenvalues of K + σ 2 I, ˆK + σ 2 I respectively. Since K ˆK is positive sei-definite (e.g. [3]), we have λ i ˆλ i 2σ 2 i (using the assuption in the proposition). Now the slack in the upper bound is: 1 2 log det( ˆK + σ 2 I) ( 1 2 log det(k + σ2 I)) = 1 2 (log λ i log ˆλ i ) Hence the slack in the lower bound is: 1 2 log det(k + σ2 I) [ 1 2 log det( ˆK + σ 2 I) 1 ] Tr(K ˆK) 2σ2 = 1 (log λ i log 2 ˆλ i ) + 1 2σ 2 (λ i ˆλ i ) Now by concavity and onotonicity of log, and since ˆλ 2σ 2, we have: 1 2 log λ i log ˆλ i λ i ˆλ i 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (λ i ˆλ i ) 1 2 (λ i ˆλ i ) (log λ i log ˆλ i )

Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes F Convergence of hyperparaeters fro optiising lower bound to optial hyperparaeters Note fro Figure 7 that the hyperparaeters found by optiising the lower bound converges to the hyperparaeters found by the GP when optiising the arginal likelihood, giving epirical evidence for the second clai in Section 3.3. G Parallelising SKC Note that SKC can be parallelised across the rando hyperparaeter initialisations, and also across the kernels at each depth for coputing the BIC intervals. In fact, SKC is even ore parallelisable with the kernel buffer: say at a certain depth, we have two kernels reaining to be optiised and evaluated before we can ove onto the next depth. If the buffer size is 5, say, then we can in fact ove on to the next depth and grow the kernel search tree on the top 3 kernels of the buffer, without having to wait for the 2 kernel evaluations to be coplete. This saves a lot of coputation tie wasted by idle cores waiting for all kernel evaluations to finish before oving on to the next depth of the kernel search tree. H Optiisation Since we wish to use the learned kernels for interpretation, it is iportant to have the hyperparaeters lie in a sensible region after the optiisation. In other words, we wish to regularise the hyperparaeters during optiisation. For exaple, we want the SE kernel to learn a globally sooth function with local variation. When naïvely optiising the lower bound, soeties the length scale and the signal variance becoes very sall, so the SE kernel explains all the variation in the signal and ends up connecting the dots. We wish to avoid this type of behaviour. This can be achieved by giving priors to hyperparaeters and optiising the energy (log prior added to the log arginal likelihood) instead, as well as using sensible initialisations. Looking at Figure 8, we see that using a strong prior with a sensible rando initialisation (see Appendix I for details) gives a sensible soothly varying function, whereas for all the three other cases, we have the length scale and signal variance shrinking to sall values, causing the GP to overfit to the data. Note that the weak prior is the default prior used in the GPstuff software [39]. Careful initialisation of hyperparaeters and inducing points is also very iportant, and can have strong influence the resulting optia. It is sensible to have the optiised hyperparaeters of the parent kernel in the search tree be inherited and used to initialise the hyperparaeters of the child. The new hyperparaeters of the child ust be initialised with rando restarts, where the variance is sall enough to ensure that they lie in a sensible region, but large enough to explore a good portion of this region. As for the inducing points, we want to spread the out to capture both local and global structure. Trying both K-eans and a rando subset of training data, we conclude that they give siilar results and resort to a rando subset. Moreover we also have the option of learning the inducing points. However, this will be considerably ore costly and show little iproveent over fixing the, as we show in Section 4. Hence we do not learn the inducing points, but fix the to a given randoly chosen set. Hence for SKC, we use axiu a posteriori (MAP) estiates instead of MLE for the hyperparaeters to calculate the BIC, since the priors have a noticeable effect on the optiisation. This is justified [23] and has been used for exaple in [11], where they argue that using the MLE to estiate the BIC for Gaussian ixture odels can fail due to singularities and degeneracies. I Hyperparaeter initialisation and priors Z N (, 1), T N(σ 2, I) is a Gaussian with ean and variance σ 2 truncated at the interval I then renoralised. Signal noise σ 2 =.1 exp(z/2) p(log σ 2 ) = N (,.2) LIN σ 2 = exp(v ) where V T N(1, [, ]), l = exp( Z 2 ) p(log σ 2 ) = logunif p(log l) = logunif SE l = exp(z/2), σ 2 =.1 exp(z/2) p(log l) = N (,.1), p(log σ 2 ) = logunif PER p in = log(1 ax(x) in(x) N ) (shortest possible period is 1 tie steps) p ax = log( ax(x) in(x) 5 ) (longest possible period is a fifth of the range of data set) l = exp(z/2), p = exp(p in + W ) or exp(p ax + U), σ 2 1 =.1 exp(z/2) w.p. 2 where W T N (.5, [, ]),

Hyunjik Ki and Yee Whye Teh Figure 7: Log arginal likelihood and hyperparaeter values after optiising the lower bound with ARD kernel on a subset of the Power plant data for different values of. This is copared against the GP values when optiising the true log arginal likelihood. Error bars show ean ± 1 standard deviation over 1 rando iterations. 1362 1361.5 W/ 2 1361 136.5 data strong prior, rand init weak prior, rand init strong prior, sall init weak prior, sall init 136 16 165 17 175 18 185 19 195 2 25 year Figure 8: GP predictions on solar data set with SE kernel for different priors and initialisations. U T N (.5, [, ])) p(log l) = t(µ =, σ 2 = 1, ν = 4), p(log p) = LN (p in.5,.25) or LN (p ax 2,.5) 1 w.p. 2 p(log σ 2 ) = logunif where LN (µ, σ 2 ) is log Gaussian, t(µ, σ 2, ν) is the student s t-distribution. J Coputation ties Look at Table 2. For all three data sets, the GP optiisation tie is uch greater than the su of the Var GP optiisation tie and the upper bound (NLD + NIP) evaluation tie for 8. Hence the savings in coputation tie for SKC is significant even for these sall data sets. Note that we show the lower bound optiisation tie against the upper bound evaluation tie instead of the evaluation ties for both, since this is what happens in SKC - the lower bound has to be optiised for each kernel, whereas the upper bound only has to be evaluated once. K Mauna and Solar plots and hyperparaeter values found by SKC Solar The solar data has 26 cycles over 285 years, which gives a periodicity of around 1.9615 years. Using SKC with = 4, we find the kernel: SE PER SE SE PER. The value of the period hyperparaeter in PER is 1.9569 years, hence SKC finds the periodicity to 3 s.f. with only 4 inducing points. The SE ter converts the global periodicity to local periodicity, with the extent of the locality governed by the length scale paraeter in SE, equal to 45. This is fairly large, but saller than the range of the doain (161-211), indicating that the periodicity spans over a long tie but isn t quite global. This is ost likely due to the static solar irradiance between the years 165-17, adding a bit of noise to the periodicities. Mauna The annual periodicity in the data and the linear trend with positive slope is clear. Linear regression gives us a slope of 1.5149. SKC with = 4 gives the kernel: SE + PER + LIN. The period hyperparaeter in PER takes value 1, hence SKC successfully finds the right periodicity. The offset l and agnitude σ 2 paraeters of LIN allow us to calculate the slope

Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Table 2: Mean and standard deviation of coputation ties (in seconds) for full GP optiisation, Var GP optiisation(with and without learning inducing points), NLD and NIP (PCG using PIC preconditioner) upper bounds over 1 rando iterations. Solar Mauna Concrete GP 29.195 ± 5.143 164.8828 ± 58.7865 43.8233 ± 127.364 Var GP =1 7.259 ± 4.3928 6.117 ± 3.8267 5.4358 ±.7298 =2 8.3121 ± 5.4763 11.9245 ± 6.879 1.241 ± 2.519 =4 1.2263 ± 4.125 17.1479 ± 1.7898 19.6678 ± 4.3924 =8 9.6752 ± 6.5343 28.9876 ± 13.31 47.2225 ± 13.1955 =16 25.633 ± 8.7934 91.46 ± 39.849 158.9199 ± 18.1276 =32 76.3447 ± 2.3337 22.2369 ± 96.749 541.4835 ± 99.6145 NLD =1.19 ±.1.33 ±.2.113 ±.4 =2.26 ±.1.46 ±.2.166 ±.7 =4.43 ±.1.79 ±.3.286 ±.5 =8.84 ±.2.154 ±.4.554 ±.12 =16.188 ±.6.338 ±.7.1188 ±.3 =32.464 ±.32.789 ±.36.255 ±.74 NIP =1.474 ±.92.12 ±.296.2342 ±.26 =2.422 ±.13.1274 ±.674.1746 ±.45 =4.284 ±.75.846 ±.43.2345 ±.483 =8.199 ±.81.553 ±.25.2176 ±.376 =16.26 ±.53.432 ±.19.2136 ±.422 =32.25 ±.19.676 ±.668.2295 ±.433 Var GP, =1 23.4 ± 14.6 42. ± 33. 11. ± 32.5 learn IP =2 38.5 ± 17.5 62. ± 66. 7. ± 97. =4 124.7 ± 99. 32. ± 236. 37. ± 341.4 =8 268.6 ± 196.6 1935. ± 113. 666. ± 41. =16 1483.6 ± 773.8 148. ± 5991. 4786. ± 46.9 =32 2923.8 ± 1573.5 39789. ± 2387. 2596. ± 82.9 by the forula σ 2 (x l) [σ 2 (x l)(x l) + σ 2 ni] 1 y where σ 2 n is the noise variance in the learned likelihood. This forula is obtained fro the posterior ean of the GP, which is linear in the inputs for the linear kernel. This value aounts to 1.515, hence the slope found by SKC is accurate to 3 s.f. L Optiising the upper bound If the upper bound is tighter and ore robust with respect to choice of inducing points, why don t we optiise the upper bound to find hyperparaeters? If this were to be possible then we can axiise this to get an upper bound of the arginal likelihood with optial hyperparaeters. In fact this holds for any analytic upper bound whose value and gradients can be evaluated cheaply. Hence for any, we can find an interval that contains the true optiised arginal likelihood. So if this interval is doinated by an interval of another kernel, we can discard the kernel and there is no need to evaluate the bounds for bigger values of. Now we wish to use values of such that we can choose the right kernel (or kernels) at each depth of the search tree with inial coputation. This gives rise to an exploitation-exploration trade-off, whereby we want to balance between raising for tight intervals that allow us to discard unsuitable kernels whose intervals fall strictly below that of other kernels, and quickly oving on to the next depth in the search tree to search for finer structure in the data. The search algorith is highly parallelisable, and thus we ay raise siultaneously for all candidate kernels. At deeper levels of the search tree, there ay be too any candidates for siultaneous coputation, in which case we ay select the ones with the highest upper bound to get tighter intervals. Such attepts are listed below. Note the two inequalities for the NLD and NIP ters: 1 2 log det( ˆK + σ 2 I) 1 Tr(K ˆK) 2σ2 1 2 log det(k + σ2 I) 1 2 y ( ˆK + σ 2 I) 1 y 1 2 y (K+σ 2 I) 1 y 1 2 log det( ˆK + σ 2 I) (5) 1 2 y (K + (σ 2 + Tr(K ˆK))I) 1 y (6) Where the first two inequalities coe fro [3], the

Hyunjik Ki and Yee Whye Teh solar irradiance 1362 1361.8 1361.6 1361.4 1361.2 1361 136.8 136.6 to get the upper bound 1 2 log det( ˆK +σ 2 I) 1 2 y (K +(σ 2 +Tr(K ˆK))I) 1 y However we found that axiising this bound gives quite a loose upper bound unless = O(N). Hence this upper bound is not very useful. 136.4 136.2 M Rando Fourier Features CO2 level 41 4 39 38 37 36 35 34 33 32 136 16 165 17 175 18 185 19 195 2 25 year (a) Solar 31 195 196 197 198 199 2 21 22 year (b) Mauna Figure 9: Plots of sall tie series data: Solar and Mauna Rando Fourier Features (RFF) (a.k.a. Rando Kitchen Sinks) was introduced by [26] as a low rank approxiation to the kernel atrix. It uses the following theore Theore 3 (Bochner s Theore [29]). A stationary kernel k(d) is positive definite if and only if k(d) is the Fourier transfor of a non-negative easure. to give an unbiased low-rank approxiation to the Gra atrix K = E[Φ Φ] with Φ R N. A bigger lowers the variance of the estiate. Using this approxiation, one can copute deterinants and inverses in O(N 2 ) tie. In the context of kernel coposition in 2, RFFs have the nice property that saples fro the spectral density of the su or product of kernels can easily be obtained as sus or ixtures of saples of the individual kernels (see Appendix N). We use this later to give a eory-efficient upper bound on the log arginal likelihood in Appendix P. third inequality is a direct consequence of K ˆK being postive sei-definite, and the last inequality is fro Michalis Titsias lecture slides 3. Also fro 4, we have that 1 2 log det( ˆK + σ 2 I) + 1 2 α (K + σ 2 I)α α y is an upper bound α R N. Thus one idea of obtaining a cheap upper bound to the optiised arginal likelihood was to solve the following axiin optiisation proble: ax θ in 1 α R N 2 log det( ˆK+σ 2 I)+ 1 2 α (K+σ 2 I)α α y One way to solve this cheaply would be by coordinate descent, where one axiises with respect to θ fixing α, then iniises with respect to α fixing θ. However σ tends to blow up in practice. This is because the expression is O( log σ 2 + σ 2 ) for fixed α, hence axiising with respect to σ pushes it towards infinity. N Rando Features for Sus and Products of Kernels For RFF the kernel can be approxiated by the inner product of rando features given by saples fro its spectral density, in a Monte Carlo approxiation, as follows: k(x y) = e iv (x y) dp(v) R D p(v)e iv (x y) dv R D where φ(x) = = E p(v) [e iv x (e iv y ) ] = E p(v) [Re(e iv x (e iv y ) )] 1 Re(e iv x k (e iv y k ) ) k=1 = E b,v [φ(x) φ(y)] 2 (cos(v 1 x+b 1 ),..., cos(v x+b )) An alternative is to su the two upper bounds above with spectral frequencies v k iid saples fro p(v) and b k iid saples fro U[, 2π]. 3 http://www.aueb.gr/users/titsias/papers/titsiasnipsvar14.pdf Let k 1, k 2 be two stationary kernels, with respective

Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes spectral densities p 1, p 2 so that k 1 (d) = a 1 ˆp 1 (d), k 2 (d) = a 2 ˆp 2 (d), where ˆp(d) := p(v)e iv d dv. We use this convention as the Fourier R D transfor. Note a i = k i (). (k 1 + k 2 )(d) = a 1 p 1 (v)e iv d dv + a 2 p 2 (v)e iv d dv = (a 1 + a 2 ) p ˆ + (d) where p + (v) = a1 a 1+a 2 p 1 (v) + a2 a 1+a 2 p 2 (v), a ixture of p 1 and p 2. So to generate RFF for k 1 + k 2, generate v p + by generating v p 1 w.p. 1 a a 1+a 2 and v a p 2 w.p. 2 a 1+a 2 Now for the product, suppose (k 1 k 2 )(d) = a 1 a 2 ˆp 1 (d) ˆp 2 (d) = a 1 a 2 ˆp (d) Then p (d) is the inverse fourier transfor of ˆp 1 ˆp 2, which is the convolution p 1 p 2 (d) := p R D 1 (z)p 2 (d z)dz. So to generate RFF for k 1 k 2, generate v p by generating v 1 p 1, v 2 p 2 and setting v = v 1 + v 2. This is not applicable for non-stationary kernels, such as the linear kernel. We deal with this proble as follows: Suppose φ 1, φ 2 are rando features such that k 1 (x, x ) = φ 1 (x) φ 1 (x ), φ 2 (x) φ 2 (x ), φ i : R D R. It is straightforward to verify that P An upper bound to NLD using Rando Fourier Features Note that the function f(x) = log det(x) is convex on the set of positive definite atrices [6]. Hence by Jensen s inequality we have, for Φ Φ an unbiased estiate of K: 1 2 log det(k + σ2 I) = f(k + σ 2 I) = f(e[φ Φ + σ 2 I]) E[f(Φ Φ + σ 2 I)] Hence 1 2 log det(φ Φ + σ 2 I) is a stochastic upper bound to NLD that can be calculated in O(N 2 ). An exaple of such an unbiased estiator Φ is given by RFF. Q Further Plots (k 1 + k 2 )(x, x ) = φ + (x) φ + (x ) (k 1 k 2 )(x, x ) = φ (x) φ (x ) where φ + ( ) = (φ 1 ( ), φ 2 ( ) ) and φ ( ) = φ 1 ( ) φ 2 ( ). However we do not want the nuber of features to grow as we add or ultiply kernels, since it will grow exponentially. We want to keep it to be features. So we subsaple entries fro φ + (or φ ) and scale by factor 2 ( for φ ), which will still give us unbiased estiates of the kernel provided that each ter of the inner product φ + (x) φ + (x ) (or φ (x) φ (x )) is an unbiased estiate of (k 1 + k 2 )(x, x )(or (k 1 k 2 )(x, x )). This is how we generate rando features for linear kernels cobined with other stationary kernels, using the features φ(x) = σ (x,..., x). O Spectral Density for PER Fro [35], we have that the spectral density of the PER kernel is: I n (l 2 ( ) exp(l 2 ) δ v 2πn ) p n= where I is the odified Bessel function of the first kind.

Hyunjik Ki and Yee Whye Teh negative log ML energy 2 15 1 5 fullgp Var LB UB NLD 35 3 25 2 15 Nystro NLD UB RFF NLD UB NIP -5-15 -25-35 (a) Mauna: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log energy ML 2 15 1 5 fullgp Var LB UB NLD 3 25 2 Nystro NLD UB RFF NLD UB NIP (b) Mauna: learn inducing points Figure 1: Sae as 1a and 1b but for Mauna Loa data. CG PCG Nys PCG FIC PCG PIC negative log ML energy -6-8 fullgp Var LB UB NLD 1 8 6 4 2 Nystro NLD UB RFF NLD UB NIP -35-45 -55 (a) Concrete: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log ML energy -6-7 -8-9 fullgp Var LB UB NLD 11 1 9 8 7 6 5 4 3 Nystro NLD UB 2 RFF NLD UB NIP -35-45 CG PCG Nys -55 PCG FIC PCG PIC 1-6 (b) Concrete: learn inducing points Figure 11: Sae as 1a and 1b but for Concrete data. SE PER LIN (SE)*SE (SE)*PER (SE)+SE (PER)+SE (PER)+PER (PER)*PER (PER)+LIN (LIN)+LIN (LIN)+SE (PER)*LIN (LIN)*LIN (SE)*LIN Figure 12: Kernel search tree for SKC on solar data up to depth 2. We show the upper and lower bounds for different nubers of inducing points.

Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Figure 13: Sae as Figure 12 but for Mauna data. Figure 14: Sae as Figure 12 but for concrete data and up to depth 1.