Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Size: px

Start display at page:

Download "Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes"

Estella Greene
5 years ago
Views:

1 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Hyunjik Ki and Yee Whye Teh University of Oxford, DeepMind Abstract Autoating statistical odelling is a challenging proble in artificial intelligence. The Autoatic Statistician takes a first step in this direction, by eploying a kernel search algorith with Gaussian Processes (GP) to provide interpretable statistical odels for regression probles. However this does not scale due to its O(N 3 ) running tie for the odel selection. We propose Scalable Kernel Coposition (SKC), a scalable kernel search algorith that extends the Autoatic Statistician to bigger data sets. In doing so, we derive a cheap upper bound on the GP arginal likelihood that sandwiches the arginal likelihood with the variational lower bound. We show that the upper bound is significantly tighter than the lower bound and thus useful for odel selection. 1 Introduction Autoated statistical odelling is an area of research in its early stages, yet it is becoing an increasingly iportant proble [14]. As a growing nuber of disciplines use statistical analyses and odels to help achieve their goals, the deand for statisticians, achine learning researchers and data scientists is at an all tie high. Autoated systes for statistical odelling serves to assist such huan resources, if not as a best alternative where there is a shortage. An exaple of a fruitful attept at autoated statistical odelling in nonparaetric regression is Copositional Kernel Search (CKS) [1], an algorith that fits a Gaussian Process (GP) to the data and autoatically Proceedings of the 21 st International Conference on Artificial Intelligence and Statistics (AISTATS) 218, Lanzarote, Spain. PMLR: Volue 84. Copyright 218 by the author(s). chooses a suitable paraetric for of the kernel. This leads to high predictive perforance that atches kernels hand-selected by GP experts [27]. There also exist other approaches that tackle this odel selection proble by using a ore flexible kernel [2, 24, 3, 41, 43]. However the distinctive feature of Duvenaud et al [1] is that the resulting odels are interpretable; the kernels are constructed in such a way that we can use the to describe patterns in the data, and thus can be used for autoated exploratory data analysis. Lloyd et al [19] exploit this to generate natural language analyses fro these kernels, a procedure that they nae Autoatic Bayesian Covariance Discovery (ABCD). The Autoatic Statistician 1 ipleents this to output a 1-15 page report when given data input. However, a liitation of ABCD is that it does not scale; due to the O(N 3 ) tie for inference in GPs, the analysis is constrained to sall data sets, specialising in one diensional tie series data. This is undesirable not only because the average size of data sets is growing fast, but also because there is potentially ore inforation in bigger data, iplying a greater need for ore expressive odels that can discover finer structure. This paper proposes Scalable Kernel Coposition (SKC), a scalable extension of CKS, to push the boundaries of autoated interpretable statistical odelling to bigger data. In suary, our work akes the following contributions: We propose the first scalable version of the Autoatic Statistician that scales up to ediu-sized data sets by reducing algorithic coplexity fro O(N 3 ) to O(N 2 ) and enhancing parallelisability. We derive a novel cheap upper bound on the GP arginal likelihood, that is used in SKC with the variational lower bound [37] to sandwich the GP arginal likelihood. We show that our upper bound is significantly 1 See for exaple analyses.

2 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes tighter than the lower bound, and plays an iportant role for odel selection. 2 ABCD and CKS The Copositional Kernel Search (CKS) algorith [1] builds on the idea that the su and product of two positive definite kernels are also positive definite. Starting off with a set B of base kernels defined on R R, the algorith searches through the space of zero-ean GPs with kernels that can be expressed in ters of sus and products of these base kernels. B = {SE,LIN,PER} is used, which correspond to the squared exponential, linear and periodic kernel respectively (see Appendix C for the for of these base kernels). Thus candidate kernels for an open-ended space of GP odels, allowing for an expressive odel. Such approaches for structure discovery have also appeared in [12, 13]. A greedy search is eployed to explore this space, with each kernel scored by the Bayesian Inforation Criterion () [31] 2 after optiising the kernel hyperparaeters by type II axiu likelihood (ML-II). See Appendix B for the algorith in detail. The resulting kernel can be siplified to be expressed as a su of products of base kernels, which has the notable benefit of interpretability. In particular, note f 1 GP (, k 1 ), f 2 GP (, k 2 ) f 1 + f 2 GP (, k 1 + k 2 ) for independent f 1 and f 2. So a GP whose kernel is a su of products of kernels can be interpreted as sus of GPs each with structure given by the product of kernels. Now each base kernel in a product odifies the odel in a consistent way. For exaple, ultiplication by SE converts global structure into local structure since SE(x, x ) decreases exponentially with x x, and ultiplication by PER is equivalent to ultiplication of the odeled function by a periodic function (see Lloyd et al [19] for detailed interpretations of different cobinations). This observation is used in Autoatic Bayesian Covariance Discovery (ABCD) [19], giving a natural language description of the resulting function odeled by the coposite kernel. In suary ABCD consists of two algoriths: the copositional kernel search CKS, and the natural language translation of the kernel into a piece of exploratory data analysis. 3 Scaling up ABCD ABCD provides a fraework for a natural extension to big data settings, in that we only need to be able to scale up CKS, then the natural language description of odels can be directly applied. The difficulty of this 2 = log arginal likelihood with a odel coplexity penalty. We use a definition where higher eans better odel fit. See Appendix A. extension lies in the O(N 3 ) tie for evaluation of the GP arginal likelihood and its gradients with respect to the kernel hyperparaeters. A naïve approach is to subsaple the data to reduce N, but then we ay fail to capture global structure such as periodicities with long periods or oit a set of points displaying a certain local structure. We show failure cases of rando subsapling in Section 4.3. Regarding ore strategic subsapling, the possibility of a generic subsapling algorith for GPs that is able to capture the aforeentioned properties of the data is a challenging research proble in itself. Alternatively it is tepting to use either an approxiate arginal likelihood or the arginal likelihood of an approxiate odel as a proxy for the likelihood [25, 32, 34, 37], especially with odern GP approxiations scaling to large data sets with illions of data points [15, 42]. However such scalable GP ethods are liited in interpretability as they often behave very differently to the full GP, lacking guarantees for the chosen kernel to faithfully reflect the actual structure in the data. In other words, the real challenge is to scale up the GP while preserving interpretability, and this is a difficult proble due to the tradeoff between scalability and accuracy of GP approxiations. Our work pushes the frontiers of interpretable GPs to ediu-sized data (N = 1K 1K) by reducing the coputational coplexity of ABCD fro O(N 3 ) to O(N 2 ). Extending this to large data sets (N = 1K 1M) is a difficult open proble. Our approach is as follows: we provide a cheap lower bound and upper bound to sandwich the arginal likelihood, and we use this interval for odel selection. To do so we give a brief overview of the relevant work on low rank kernel approxiations used for scaling up GPs, and we later outline how they can be applied to obtain cheap lower and upper bounds. 3.1 Nyströ Methods and Sparse GPs The Nyströ Method [9, 4] selects a set of inducing points in the input space R D that attept to explain all the covariance in the Gra atrix of the kernel; the kernel is evaluated for each pair of inducing points and also between the inducing points and the data, giving atrices K, K N = KN. This is used to create the Nyströ approxiation ˆK = K N (K ) K N of K = K NN, where is the pseudo-inverse. Applying Cholesky decoposition to K, we see that the approxiation adits the low-rank for Φ Φ and so allows efficient coputation of deterinants and inverses in O(N 2 ) tie (see Appendix D). We later use the Nyströ approxiation to give an upper bound on the log arginal likelihood.

3 Hyunjik Ki and Yee Whye Teh The Nyströ approxiation arises naturally in the sparse GP literature, where certain distributions are approxiated by sipler ones involving f, the GP evaluated at the inducing points: the Deterinistic Training Conditional (DTC) approxiation [32] defines a odel that gives the arginal likelihood q(y) = N (y, ˆK + σ 2 I) (y is the vector of observations), whereas the Fully Independent Conditional (FIC) approxiation [34] gives q(y) = N (y, ˆK + diag(k ˆK) + σ 2 I), correcting the Nyströ approxiation along the diagonals. The Partially Independent Conditional (PIC) approxiation [25] further iproves this by correcting the Nyströ approxiation on block diagonals, with blocks typically of size. Note that the approxiation is no longer low rank for FIC and PIC, but atrix inversion can still be coputed in O(N 2 ) tie by Woodbury s Lea (see Appendix D). The variational inducing points ethod (VAR) [37] is rather different to DTC/FIC/PIC in that it gives the following variational lower bound on the log arginal likelihood: log[n (y, ˆK + σ 2 I)] 1 2σ 2 Tr(K ˆK) (1) This lower bound is optiised with respect to the inducing points and the kernel hyperparaeters, which is shown in the paper to successfully yield tight lower bounds in O(N 2 ) tie for reasonable values of. Another useful property of VAR is that the lower bound can only increase as the set of inducing points grows [21, 37]. It is also known that VAR always iproves with extra coputation, and that it successfully recovers the true posterior GP in ost cases, contrary to other sparse GP ethods [4]. Hence this is what we use in SKC to obtain a lower bound on the arginal likelihood and optiise the hyperparaeters. We use 1 rando initialisations of hyperparaeters and choose the one with highest lower bound after optiisation. 3.2 A cheap and tight upper bound on the log arginal likelihood Fixing the hyperparaeters to be those tuned by VAR, we seek a cheap upper bound to the arginal likelihood. Upper bounds and lower bounds are qualitatively different, and in general it is ore difficult to obtain an upper bound than a lower bound for the following reason: first note that the arginal likelihood is the integral of the likelihood with respect to the prior density of paraeters. Hence to obtain a lower bound it suffices to exhibit regions in the paraeter space giving high likelihood. However, to obtain an upper bound one ust deonstrate the absence or lack of likelihood ass outside a certain region, an arguably ore difficult task. There has been soe work on the subject [5, 17], but to the best of our knowledge there has not been any work on cheap upper bounds to the arginal likelihood in large N settings. So finding an upper bound fro the perspective of the arginal likelihood can be difficult. Instead, we exploit the fact that the GP arginal likelihood has an analytic for, and treat it as a function of K. The GP log arginal likelihood is coposed of two ters and a constant: log p(y) = log[n (y, K + σ 2 I)] = 1 2 log det(k + σ2 I) 1 2 y (K + σ 2 I) 1 y N 2 log(2π) (2) We give separate upper bounds on the negative log deterinant (NLD) ter and the negative inner product (NIP) ter. For NLD, it has been proven that 1 2 log det(k + σ2 I) 1 2 log det( ˆK + σ 2 I) (3) a consequence of K ˆK being a Schur copleent of K and hence positive sei-definite (e.g. [3]). So the Nyströ approxiation plugged into the NLD ter serves as an upper bound that can be coputed in O(N 2 ) tie (see Appendix D). As for NIP, we point out that λy (K + σ 2 I) 1 y = N in f H i=1 (y i f(x i )) 2 + λ f 2 H, the optial value of the objective function in kernel ridge regression where H is the Reproducing Kernel Hilbert Space associated with k (e.g. [23]). The dual proble, whose objective function has the sae optial value, is ax α R N λ[α (K + σ 2 I)α 2α y]. So we have the following upper bound: 1 2 y (K + σ 2 I) 1 y 1 2 α (K + σ 2 I)α α y (4) α R N. Note that this is also in the for of an objective for conjugate gradients (CG) [33], hence equality is obtained at the optial value ˆα = (K + σ 2 I) 1 y. We can approach the optiu for a tighter bound by using CG or preconditioned CG (PCG) for a fixed nuber of iterations to get a reasonable approxiation to ˆα. Each iteration of CG and the coputation of the upper bound takes O(N 2 ) tie, but PCG is very fast even for large data sets and using FIC/PIC as the preconditioner gives fastest convergence in general [8]. Recall that although the lower bound takes O(N 2 ) to copute, we need = O(N β ) for accurate approxiations, where β depends on the data distribution and kernel [28]. β is usually close to.5, hence the lower bound is also effectively O(N 2 ). In practice, the upper bound evaluation sees to be a little ore expensive than the lower bound evaluation, but we only need to copute the upper bound once, whereas we ust evaluate the lower bound and its gradients ultiple ties

4 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Algorith 1: Scalable Kernel Coposition (SKC) Input: data x 1,..., x n R D, y 1,..., y n R, base kernel set B, depth d, nuber of inducing points, kernel buffer size S. Output: k, the resulting kernel. For each base kernel on each diension, obtain lower and upper bounds to ( interval), set k to be the kernel with highest upper bound, and add k to kernel buffer K. C for depth=1:d do Fro C, add to K all kernels whose intervals overlap with k if there are fewer than S of the, else add the kernels with top S upper bounds. for k K do Add following kernels to C and obtain their intervals: (1) All kernels of for k + B where B is any base kernel on any diension (2) All kernels of for k B where B is any base kernel on any diension if k C with higher upper bound than k then k k for the hyperparaeter optiisation. We later confir in Section 4.1 that the upper bound is fast to copute relative to the lower bound optiisation. We also show epirically that the upper bound is tighter than the lower bound in Section 4.1, and give the following sufficient condition for this to be true: Proposition 1. Suppose (ˆλ i ) N i=1 are the eigenvalues of ˆK + σ 2 I in descending order. Then if (P)CG for the NIP ter converges and ˆλ N 2σ 2, then the upper bound is tighter than the lower bound. Notice that ˆλ N σ 2 ˆK, so the assuption is feasible. The proof is in Appendix E. We later show in Section 4 that the upper bound is not only tighter than the lower bound, but also uch less sensitive to the choice of inducing points. Hence we use the upper bound to choose between kernels whose intervals overlap. 3.3 SKC: Scalable Kernel Coposition using the lower and upper bound We base our algorith on two intuitive clais. First, the lower and upper bounds converge to the true log arginal likelihood for fixed hyperparaeters as the nuber of inducing points increases. Second, the hyperparaeters optiised by VAR converge to those obtained by optiising the log arginal likelihood as increases. The forer is confired in Figure 2 as well as in other works (e.g. [4] for the lower bound), and the latter is confired in Appendix F. The algorith proceeds as follows: for each base kernel and a fixed value of, we copute the lower and upper bounds to obtain an interval for the GP arginal likelihood and hence the of the kernel, with its hyperparaeters optiised by VAR. We rank these kernels by their intervals, using the upper bound as a tie-breaker for kernels with overlapping intervals. We then perfor a sei-greedy kernel search, expanding the search tree on all (or soe, controlled by buffer size S) kernels whose intervals overlap with the top kernel at the current depth. We recurse to the next depth by coputing intervals for these child kernels (parent kernel +/ base kernel), ranking the and further expanding the search tree. This is suarised in Algorith 1, and Figure 12 in Appendix Q is a visualisation of the tree for different values of. Details on the optiisation and initialisation are given in Appendices H and I. The hyperparaeters found by VAR ay not be the global axiisers of the GP arginal likelihood, but as in ABCD we can optiise the arginal likelihood with ultiple rando seeds and choose the local optiu closest to the global optiu. One ay still question whether the hyperparaeter values found by VAR agree with the structure in the data, such as period values and slopes of linear trends. We show in Sections 4.2 and 4.3 that a sall suffices for this to be the case. Choice of We can guarantee that the lower bound increases with larger, but cannot guarantee that the upper bound decreases, since we tend to get better hyperparaeters for higher that boost the arginal likelihood and hence the upper bound. We verify this in Section 4.1. Hence throughout SKC we fix to be the largest possible value that one can afford, so that the arginal likelihood with hyperparaeters optiised by VAR is as close as possible to the arginal likelihood with optial hyperparaeters. It is natural to wonder whether an adaptive choice of is possible, using higher for ore proising kernels to tighten their bounds. However a fair coparison of different kernels via this lower and upper bound requires that they have the sae value of, since using a higher is guaranteed to give a higher lower bound. Parallelisability Note that SKC is extreely parallelisable across different rando initialisations and different kernels at each depth, as is CKS. In fact the presence of intervals for SKC and hence buffers of kernels allows further parallelisation over kernels of different depths in certain cases (see Appendix G).

5 Hyunjik Ki and Yee Whye Teh negative log energy ML fullgp Var LB UB NLD Nystro NLD UB RFF NLD UB NIP (a) Solar: fix inducing points CG PCG Nys PCG FIC PCG PIC negative energy log ML fullgp Var LB UB NLD Nystro NLD UB RFF NLD UB NIP (b) Solar: learn inducing points CG PCG Nys PCG FIC PCG PIC Figure 1: (a) Left: log arginal likelihood (ML) for fullgp with optiised hyperparaeters, optiised VAR LB for each of 1 rando initialisations per, log ML for best hyperparaeters out of 1, and corresponding UB. Middle: NLD and UB. Right: NIP and UB after iterations of CG/PCG. (b) Sae as Figure 1a, except learning inducing points for the LB optiisation and using the for subsequent coputations. 4 Experients 4.1 Investigating the behaviour of the lower bound (LB) and upper bound (UB) We present results for experients showing the bounds we obtain for two sall tie series and a ultidiensional regression data set, for which CKS is feasible. The first is the annual solar irradiance data fro 161 to 211, with 42 observations [18]. The second is the tie series Mauna Loa CO2 data [36] with 689 observations. See Appendix K for plots of the tie series. The ultidiensional data set is the concrete copressive strength data set with 13 observations and 8 covariates [44]. The functional for of kernels used for each of these data sets have been found by CKS (see Figure 3). All observations and covariates have been noralised to have ean and variance 1. Fro the left of Figures 1a, 1a and 11a, (the latter two can be found in Appendix Q) we see that VAR gives a LB for the optial log arginal likelihood that iproves with increasing. The best LB is tight relative to the log arginal likelihoods at the hyperparaeters optiised by VAR. We also see that the UB is even tighter than the LB, and increases with as hypothesised. Fro the iddle plots, we observe that the Nyströ approxiation gives a very tight UB on the NLD ter. We also tried using RFF to get a stochastic UB (see Appendix P), but this is not as tight, especially for larger values of. Fro the right plots, we can see that PCG with any of the three preconditioners (Nyströ, FIC, PIC) give a very tight UB to the NIP ter, whereas CG ay require ore iterations to get tight, for exaple in Figures 1a, 1b in Appendix Q. Figure 2 shows further experiental evidence for a wider range of kernels reinforcing the clai that the UB is tighter than the LB, as well as being uch less sensitive to the choice of inducing points. This is why the UB is a uch ore reliable etric for the odel selection than the LB. Hence we use the UB for a one-off coparison of kernels when their intervals overlap. Coparing Figures 1a, 1a, 11a against 1b, 1b, 11b, learning inducing points does not lead to a vast iproveent in the VAR LB. In fact the differences are not very significant, and soeties learning inducing points can get the LB stuck in a bad local iniu, as indicated by the high variance of LB in the latter three figures. Moreover the differences in coputational tie is significant as we can see in Table 2 of Appendix J. Hence the coputation-accuracy trade-off is best when fixing the inducing points to a rando subset of training data. Table 2 also copares ties for the different coputations after fixing the inducing points. The gains fro using the variational LB instead of the full GP is clear, especially for the larger data sets, and we also confir that it is indeed the optiisation of the LB that is the bottleneck in ters of coputational cost. We also see that the NIP UB coputation ties are siilarly fast for all, thus convergence of PCG with the PIC preconditioner is happening in only a few iterations. 4.2 SKC on sall data sets We copare the kernels chosen by CKS and by SKC for the three data sets. The results are suarised in Figure 3. For solar, we see that SKC successfully finds SE PER, which is the second highest kernel for CKS, with very close to the top kernel. For auna, SKC selects (SE + PER) SE + LIN, which is third highest for CKS and a very close to the top kernel. Looking at the values of hyperparaeters in kernels PER and LIN found by SKC, 4 inducing points are sufficient for it to successfully find the correct periods and slopes in the data, reinforcing the clai that we only need a sall to find good hyperparaeters (see Appendix

6 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes LIN1 LIN2 LIN3 LIN4 LIN5 LIN6 LIN7 LIN8 PER1 PER2 PER3 PER4 PER5 PER6 PER7 PER8 SE1 SE2 SE3 SE4 SE5 SE6 SE7 SE8 Figure 2: UB and LB for kernels at depth 1 on each diension of concrete data, while varying the inducing points with hyperparaeters fixed to the optial values for the full GP. Error bars show ean ± 1 standard deviation over 1 rando sets of inducing points. CKS Solar SKC Solar, =8, S=2 SE SE (SE)*PER -> (SE)*PER -> ((SE)*PER)*PER (SE)+PER ((SE)+PER)*SE (((SE)+PER)*SE)+LIN ((((SE)+PER)*SE)+LIN)+SE -> (((((SE)+PER)*SE)+LIN)+SE)+SE CKS Mauna SE SE8 (SE8)+SE1 ((SE8)+SE1)*PER7 (((SE8)+SE1)*PER7)*SE2 ((((SE8)+SE1)*PER7)*SE2)+SE4 -> (((((SE8)+SE1)*PER7)*SE2)+SE4)+LIN CKS Concrete ((SE)*PER)+LIN (SE)+PER ((SE)+PER)*SE -> (((SE)+PER)*SE)+LIN (((SE)+PER)*SE)+PER ((((SE)+PER)*SE)+PER)+LIN (((((SE)+PER)*SE)+PER)+LIN)+LIN SKC Mauna, =8, S=2 SE SE8 (SE8)+SE1 ((SE8)+SE1)+SE2 (((SE8)+SE1)+SE2)+SE4 ((((SE8)+SE1)+SE2)+SE4)+SE3 -> (((((SE8)+SE1)+SE2)+SE4)+SE3)+SE SKC Concrete, =8, S= Figure 3: CKS & SKC results for up to depth 6. Left: of kernels chosen at each depth by CKS. Right: intervals of kernels that have been added to the buffer by SKC with = 8. The arrow indicates the kernel chosen at the end. K for details). For concrete, a ore challenging eight diensional data set, we see that the kernels selected by SKC do not atch those selected by CKS, but it still anages to find siilar additive structure such as SE1+SE8 and SE4. Of course, the intervals for kernels found by SKC are for hyperparaeters found by VAR with = 8, hence do not necessarily contain the optial of kernels in CKS. However the above results show that our ethod is still capable of selecting appropriate kernels even for low values of, without having to hoe in to the optial using high values of. The savings in coputation tie is significant even for these sall data sets, as shown in Table SKC on ediu-sized data sets & Why the lower bound is not enough The UB has arginal benefits over the LB for sall data sets, where the LB is already a fairly good estiate of the true. However, this gap becoes significant as N grows and as kernels becoe ore coplex; the UB is uch tighter and ore stable with respect to

7 load(kw) load(kw) Hyunjik Ki and Yee Whye Teh the choice of inducing points (as shown in Figure 2), and plays a crucial role in the odel selection. (SE1)+SE2 ((SE1)+SE2)+SE2 (((SE1)+SE2)+SE2)*SE3 -> ((((SE1)+SE2)+SE2)*SE3)*SE4 SE1 (SE1)+SE2 ((SE1)+SE2)*PER3 (((SE1)+SE2)*PER3)+LIN4 -> ((((SE1)+SE2)*PER3)+LIN4)+SE2 SKC on Power Plant, = SE1 SKC-LB on Power Plant, = Figure 4: Coparison of SKC (top) and SKC-LB (botto), with = 32, S = 1 to depth 5 on Power Plant data. The forat is the sae as Figure 3 but with the crosses at the true instead of the bounds. Power Plant We first show this for SKC on the Power Plant data with 9568 observations and 4 covariates [38]. We see fro Figure 4 that again the UB is uch tighter than the LB, especially in ore coplex kernels further down the search tree. Coparing the two plots, we also see that the use of the UB in SKC has a significant positive ipact on odel selection: the kernel found by SKC at depth 5 has , whereas the corresponding kernel found by just using the LB (SKC-LB) has The pathological behaviour of SKC-LB occurs at depth 3, where the (SE1+SE2)*PER3 kernel found by SKC-LB has a higher LB than the corresponding SE1+SE2+SE2 kernel found by SKC. SKC-LB chooses the forer kernel since it has a higher LB, which is a failure ode since the latter kernel has a significantly higher. Due to the tightness of the UB, we see that SKC correctly chooses the latter kernel over the forer, escaping the failure ode. CKS vs SKC runtie We run CKS and SKC for = 16, 32, 64 on Power Plant data on the sae achine with the sae hyperparaeter setting. The runties up to depths 1/2 are: 28.7h/94.7h (CKS),.6h/2.6h (SKC, = 16), 1.1h/4.1h (SKC, = 32), 4.2h/15.h (SKC, = 64). Again, the reduction in coputation tie for SKC are substantial. Tie Series We also ipleent SKC on ediu-sized tie series data to explore whether it can successfully find known structure at different scales. Such data sets occur frequently for hourly tie series over several years or daily tie series over centuries. GEFCOM First we use the energy load data set fro the 212 Global Energy Forecasting Copetition [16], and hourly tie series of energy load fro 24/1/1 to 28/6/3, with 8 weeks of issing data, giving N = 38, 7. Notice that this data size is far beyond the scope of full GP optiisation in CKS. # year # day in Jan 24 Figure 5: Top: plot of full GEFCOM data. Botto: zoo in on the first 7 days. Fro the plots, we can see that there is a noisy 6-onth periodicity in the tie series, as well as a clear daily periodicity with peaks in the orning and evening. Despite the periodicity, there are soe noticeable irregularities in the daily pattern. Table 1: Hyperparaeters of kernels found by SKC on GEFCOM data and on Tidal data after noralising y. Length scales, periods, and location converted to original scale, σ 2 left as was found with noralised y. GEFCOM Tidal SE 1 σ 2 =.44 SE σ 2 = 2.32 l = 6.5 days l = 82.5 days PER 1 σ 2 = 1.1 PER 1 σ 2 = 5.17 l = 189 days l = 126 days p = 1.3 days p =.538 days SE 2 σ 2 =.18 LIN σ 2 =.18 l = 331 days loc = year 215. PER 2 σ 2 =.6 PER 2 σ 2 =.8 l = 17 days l = 5974 days p = 174 days p =.5 days LIN σ 2 =.17 PER 3 σ 2 =.21 loc = year 26.1 l = 338 days p = 14.6 days The SE 1 PER 1 + SE 2 (PER 2 + LIN) kernel found by SKC with = 16 and its hyperparaeters are suarised in the first two coluns of Table 1. Note that with only 16 inducing points, SKC has succesfully found the daily periodicity (PER 1 ) and the 6-onth periodicity (PER 2 ). The SE 1 kernel in the first additive ter suggests a local periodicity, exhibiting the irregular observation pattern that is repeated daily. Also note that the hyperparaeter corresponding to the longer periodicity is close but not ly half a year, owing to the noisiness of the periodicity that is apparent in the data. Moreover the agnitude of the second additive ter SE 2 PER 2 that contains the longer periodicity is.18.6, which is uch saller than the agnitude of the first additive ter SE 1 PER 1. This explicitly shows that the second ter has less effect than the first ter on the behaviour of the

8 sea level elevation sea level elevation Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes tie series, hence the weaker long-ter periodicity. Running the kernel search just using the LB with the sae hyperparaeters as SKC, the algorith is only able to detect the daily periodicity and isses the 6-onth periodicity. This is consistent with the observation that the LB is loose copared to the UB, especially for bigger data sets, hence fails to evaluate the kernels correctly. We also tried selecting a rando subset of the data (sizes 32, 64, 128, 256) and running CKS. For even a subset of size 256, the algorith could not find the daily periodicity. None of the four subset size values were able to ake it past depth 4 in the kernel search, struggling to find sophisticated structure. Tidal We also run SKC on tidal data, naely the sea level easureents in Dover fro 213/1/3 to 216/8/31 [1]. This is an hourly tie series, with each observation taken to be the ean of four readings taken every 15 inutes, giving N = 31, 957. essentially has the role of raising the agnitude of the resulting kernel and hence representing noise in the data. The PER 2 kernel is also negligible due to its high length scale and sall agnitude, indicating that the aplitude of the periodicity due to this ter is very sall. Hence the ter has inial effect on the kernel. The kernel search using just the LB fails to proceed past depth 3, since it is unable to find a kernel with a higher LB on the than the previous depth (so the extra penalty incurred by increasing odel coplexity outweighs the increase in LB of log arginal likelihood). CKS on a rando subset of the data siilarly fails, the kernel search halting at depth 2 even for rando subsets as large as size 256. In both cases, SKC is able to detect the structure of data and provide accurate nuerical estiates of its features, all with uch fewer inducing points than N, whereas the kernel search using just the LB or a rando subset of the data both fail. 5 5 Conclusion and Discussion year days since Jan 23 Figure 6: Top: plot of full tidal data. Right: zoo in on the first 4 weeks. Looking at the botto of Figure 6, we can find clear 12- hour periodicities and aplitudes that follow slightly noisy bi-weekly periodicities. The shorter periods, called sei-diurnal tides, are on average 12 hours 25 inutes.518 days long, and the bi-weekly periods arise due to gravitational effects of the oon [22]. The SE PER 1 LIN PER 2 PER 3 kernel found by SKC with = 64 and its hyperparaeters are suarised in the right two coluns of Table 1. The PER 1 PER 3 kernel precisely corresponds to the doubly periodic structure in the data, whereby we have sei-daily periodicities with bi-weekly periodic aplitudes. The SE kernel has a length scale of 82.5 days, giving a local periodicity. This represents the irregularity of the aplitudes as can be seen on the botto plot of Figure 6 between days 2 and 4. The LIN kernel has a large agnitude, but its effect is negligible when ultiplied since the slope is calculated to be (see Appendix K for the slope calculation forula). It We have introduced SKC, a scalable kernel discovery algorith that extends CKS and hence ABCD to bigger data sets. We have also derived a novel cheap upper bound to the GP arginal likelihood that sandwiches the arginal likelihood with the variational lower bound [37], and use this interval in SKC for selecting between different kernels. The reasons for using an upper bound instead of just the lower bound for odel selection are as follows: the upper bound allows for a sei-greedy approach, allowing us to explore a wider range of kernels and copensates for the suboptiality coing fro local optia in the hyperparaeter optiisation. Should we wish to restrict the range of kernels explored for coputational efficiency, the upper bound is tighter and ore stable than the lower bound, hence we ay use the upper bound as a reliable tie-breaker for kernels with overlapping intervals. Equipped with this upper bound, our ethod can pinpoint global/local periodicities and linear trends in tie series with tens of thousands of data points, which are well beyond the reach of its predecessor CKS. For future work we wish to ake the algorith even ore scalable: for large data sets where quadratic runtie is infeasible, we can apply stochastic variational inference for GPs [15] to optiise the lower bound, the bottleneck of SKC, using ini-batches of data. Finding an upper bound that is cheap and tight enough for odel selection would pose a challenge. Also one drawback of the upper bound that could perhaps be resolved is that hyperparaeter tuning by optiising the lower bound is difficult (see Appendix L). One other inor scope for future work is using ore accurate

9 Hyunjik Ki and Yee Whye Teh estiates of the odel evidence than. A related paper uses Laplace approxiation instead of for kernel evaluation in a siilar kernel search context [2]. However Laplace approxiation adds on an expensive Hessian ter, for which it is unclear how one can obtain lower and upper bounds. Acknowledgeents HK and YWT s research leading to these results has received funding fro the European Research Council under the European Union s Seventh Fraework Prograe (FP7/27-213) ERC grant agreeent no We would also like to thank Michalis Titsias for helpful discussions. References [1] British Oceanographic Data Centre, UK Tide Gauge Network. data_systes/sea_level/uk_tide_gauge_network/ processed/. [2] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the so algorith. In ICML, 24. [3] Réy Bardenet and Michalis Titsias. Inference for deterinantal point processes without spectral knowledge. NIPS, 215. [4] Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasussen. Understanding probabilistic sparse gaussian process approxiations. NIPS, 216. [5] Matthew Jaes Beal. Variational algoriths for approxiate Bayesian inference. PhD thesis, University College London, 23. [6] Stephen Boyd and Lieven Vandenberghe. Convex optiization. Cabridge University Press, 24. [7] Corinna Cortes, Mehryar Mohri, and Aeet Talwalkar. On the ipact of kernel approxiation on learning accuracy. In AISTATS, 21. [8] Kurt Cutajar, Michael A. Osborne, John P. Cunningha, and Maurizio Filippone. Preconditioning kernel atrices. ICML, 216. [9] Petros Drineas and Michael Mahoney. On the nyströ ethod for approxiating a gra atrix for iproved kernel-based learning. JMLR, 6: , 25. [1] David Duvenaud, Jaes Lloyd, Roger Grosse, Joshua Tenenbau, and Ghahraani Zoubin. Structure discovery in nonparaetric regression through copositional kernel search. In ICML, 213. [11] Chris Fraley and Adrian E Raftery. Bayesian regularization for noral ixture estiation and odel-based clustering. Journal of classification, 24(2): , 27. [12] Jacob Gardner, Chuan Guo, Kilian Weinberger, Roan Garnett, and Roger Grosse. Discovering and exploiting additive structure for bayesian optiization. In AISTATS, 217. [13] Roger B. Grosse, Ruslan Salakhutdinov, Willia T. Freean, and Joshua B. Tenenbau. Exploiting copositionality to explore a large space of odel structures. In UAI, 212. [14] Isabelle Guyon, Iad Chaabane, Hugo Jair Escalante, Sergio Escalera, Dair Jajetic, Jaes Robert Lloyd, Núria Macià, Bisakha Ray, Lukasz Roaszko, Michèle Sebag, Alexander Statnikov, Sébastien Treguer, and Evelyne Viegas. A brief review of the chalearn autol challenge: Any-tie any-dataset learning without huan intervention. In Proceedings of the Workshop on Autoatic Machine Learning, volue 64, pages 21 3, 216. [15] Jaes Hensan, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. UAI, 213. [16] Tao Hong, Pierre Pinson, and Shu Fan. Global energy forecasting copetition 212, 214. [17] Chunlin Ji, Haige Shen, and Mike West. Bounded approxiations for arginal likelihoods. 21. [18] Judith Lean, Juerg Beer, and Rayond S Bradley. Reconstruction of solar irradiance since 161: Iplications for cliate cbange. Geophysical Research Letters, 22(23), [19] Jaes Robert Lloyd, David Duvenaud, Roger Grosse, Joshua Tenenbau, and Zoubin Ghahraani. Autoatic construction and natural-language description of nonparaetric regression odels. In AAAI, 214. [2] Gustavo Malkoes, Charles Schaff, and Roan Garnett. Bayesian optiization for autoated odel selection. In NIPS, 216. [21] Alexander G de G Matthews, Jaes Hensan, Richard E Turner, and Zoubin Ghahraani. On sparse variational ethods and the kullback-leibler divergence between stochastic processes. AISTATS, 216. [22] John Morrissey, Jaes L Suich, and Deanna R. Pinkard-Meier. Introduction To The Biology Of Marine Life. Jones & Bartlett Learning, [23] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 212. [24] Junier B Oliva, Avinava Dubey, Andrew G Wilson, Barnabás Póczos, Jeff Schneider, and Eric P Xing. Bayesian nonparaetric kernel-learning. In AISTATS, 216. [25] Joaquin Quinonero-Candela and Carl Edward Rasussen. A unifying view of sparse approxiate gaussian process regression. JMLR, 6: , 25. [26] Ali Rahii and Benjain Recht. Rando features for large-scale kernel achines. In NIPS, 27. [27] Carl Edward Rasussen and Chris Willias. Gaussian processes for achine learning. MIT Press, 26. [28] Alessandro Rudi, Raffaello Caoriano, and Lorenzo Rosasco. Less is ore: Nyströ coputational regularization. In NIPS, 215. [29] Walter Rudin. Fourier analysis on groups. AMS, [3] Yves-Laurent Ko Sao and Stephen Roberts. Generalized spectral kernels. arxiv preprint arxiv: , 215. [31] Gideon Schwarz et al. Estiating the diension of a odel. The Annals of Statistics, 6(2): , [32] Matthias Seeger, Christopher Willias, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In AISTATS, 23.

10 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes [33] Jonathan Richard Shewchuk. An introduction to the conjugate gradient ethod without the agonizing pain, [34] Edward Snelson and Zoubin Ghahraani. Sparse gaussian processes using pseudo-inputs. In NIPS, 25. [35] Arno Solin and Sio Särkkä. Explicit link between periodic covariance functions and state space odels. JMLR, 214. [36] Pieter Tans and Ralph Keeling. NOAA/ESRL, Scripps Institution of Oceanography. ftp://ftp.cdl.noaa. gov/ccg/co2/trends/co2 lo.txt. [37] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, 29. [38] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated cobined cycle power plant using achine learning ethods. International Journal of Electrical Power & Energy Systes, 6:126 14, datasets/cobined+cycle+power+plant. [39] Jarno Vanhatalo, Jaakko Riihiäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, and Aki Vehtari. Gpstuff: Bayesian odeling with gaussian processes. JMLR, 14(Apr): , 213. [4] Christopher Willias and Matthias Seeger. Using the nyströ ethod to speed up kernel achines. In NIPS, 21. [41] Andrew Gordon Wilson and Ryan Prescott Adas. Gaussian process kernels for pattern discovery and extrapolation. arxiv preprint arxiv: , 213. [42] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on assively scalable Gaussian processes. arxiv preprint arxiv: , [43] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In AISTATS, 216. [44] I-C Yeh. Modeling of strength of high-perforance concrete using artificial neural networks. Ceent and Concrete research, 28(12): , Concrete+Copressive+Strength.

arxiv: v2 [stat.ml] 14 Feb 2018 Abstract

arxiv: v2 [stat.ml] 14 Feb 2018 Abstract Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes arxiv:176.2524v2 [stat.ml] 14 Feb 218 Abstract Autoating statistical odelling is a challenging proble in artificial