Edinburgh Research Explorer

Save this PDF as:

Size: px
Start display at page:

Download "Edinburgh Research Explorer"


1 Edinburgh Research Explrer Supprting User-Defined Functins n Uncertain Data Citatin fr published versin: Tran, TTL, Dia, Y, Suttn, CA & Liu, A 23, 'Supprting User-Defined Functins n Uncertain Data' Prceedings f the VLDB Endwment (PVLDB), vl. 6, n. 6, pp Link: Link t publicatin recrd in Edinburgh Research Explrer Dcument Versin: Publisher's PDF, als knwn as Versin f recrd Published In: Prceedings f the VLDB Endwment (PVLDB) General rights Cpyright fr the publicatins made accessible via the Edinburgh Research Explrer is retained by the authr(s) and / r ther cpyright wners and it is a cnditin f accessing these publicatins that users recgnise and abide by the legal requirements assciated with these rights. Take dwn plicy The University f Edinburgh has made every reasnable effrt t ensure that Edinburgh Research Explrer cntent cmplies with UK legislatin. If yu believe that the public display f this file breaches cpyright please cntact prviding details, and we will remve access t the wrk immediately and investigate yur claim. Dwnlad date: 2. Apr. 29

2 Supprting User-Defined Functins n Uncertain Data Thanh T. L. Tran, Yanlei Dia, Charles Suttn, Anna Liu University f Massachusetts, Amherst University f Edinburgh ABSTRACT Uncertain data management has becme crucial in many sensing and scientific applicatins. As user-defined functins (UDFs) becme widely used in these applicatins, an imprtant task is t capture result uncertainty fr queries that evaluate UDFs n uncertain data. In this wrk, we prvide a general framewrk fr supprting UDFs n uncertain data. Specifically, we prpse a learning apprach based n Gaussian prcesses (GPs) t cmpute apprximate utput distributins f a UDF when evaluated n uncertain input, with guaranteed errr bunds. We als devise an nline algrithm t cmpute such utput distributins, which emplys a suite f ptimizatins t imprve accuracy and perfrmance. Our evaluatin using bth real-wrld and synthetic functins shws that ur prpsed GP apprach can utperfrm the state-f-the-art sampling apprach with up t tw rders f magnitude imprvement fr a variety f UDFs.. INTRODUCTION Uncertain data management has becme crucial in many applicatins including sensr netwrks [9], bject tracking and mnitring [22], severe weather mnitring [3], and digital sky surveys [2]. Data uncertainty arises due t a variety f reasns such as measurement errrs, incmplete bservatins, and using inference t recver missing infrmatin [22]. When such data is prcessed via queries, its uncertainty prpagates t prcessed results. The ability t capture the result uncertainty is imprtant t the end user fr interpreting the derived infrmatin apprpriately. Fr example, knwing nly the mean f a distributin fr the result cannt distinguish between a sure event and a highly uncertain event, which may result in wrng decisin making; knwing mre abut the distributin can help the user avid such misunderstanding and ill-infrmed actins. Recent wrk n uncertain data management has studied intensively relatinal query prcessing n uncertain data (e.g., [4, 7, 6, 9, 2, 23]). Our wrk, hwever, is mtivated by the bservatin that real-wrld applicatins, such as scientific cmputing and financial analysis, make intensive use f user-defined functins (UDFs) that prcess and analyze the data using cmplex, dmain-specific algrithms. In practice, UDFs can be prvided in any frm f external cde, e.g., C prgrams, and hence are treated mainly as black Permissin t make digital r hard cpies f all r part f this wrk fr persnal r classrm use is granted withut fee prvided that cpies are nt made r distributed fr prfit r cmmercial advantage and that cpies bear this ntice and the full citatin n the first page. T cpy therwise, t republish, t pst n servers r t redistribute t lists, requires prir specific permissin and/r a fee. Articles frm this vlume were invited t present their results at The 39th Internatinal Cnference n Very Large Data Bases, August 26th - 3th 23, Riva del Garda, Trent, Italy. Prceedings f the VLDB Endwment, Vl. 6, N. 6 Cpyright 23 VLDB Endwment /3/4... $.. bxes in traditinal databases. These UDFs are ften expensive t cmpute due t the cmplexity f prcessing. Unfrtunately, the supprt fr UDFs n uncertain data is largely lacking in tday s data management systems. Cnsequently, in the trnad detectin applicatin [3], detectin errrs cannt be distinguished frm true events due t the lack f assciated cnfidence scres. In ther applicatins such as cmputatinal astrphysics [2], the burden f characterizing UDF result uncertainty is impsed n the prgrammers: we bserved that the prgrammers f the Slan digital sky surveys manually cde algrithms t keep track f uncertainty in a number f UDFs. These bservatins have mtivated us t prvide system supprt t autmatically capture result uncertainty f UDFs, hence freeing users frm the burden f ding s and returning valuable infrmatin fr interpreting query results apprpriately. Mre cncretely, let us cnsider tw examples f UDFs in the Slan digital sky surveys (SDSS) [2]. In SDSS, nightly bservatins f stars and galaxies are inherently nisy as the bjects can be t dim t be recgnized in a single image. Hwever, repeated bservatins allw the scientists t mdel the psitin, brightness, and clr f bjects using cntinuus distributins, which are cmmnly Gaussian distributins. Assume the prcessed data is represented as (bjid, ps p, redshift p,...) where ps and redshift are uncertain attributes. Then, queries can be issued t detect features r prperties f the bjects. We cnsider sme example UDFs frm an astrphysics package []. Query Q belw cmputes the age f each galaxy given its redshift using the UDF GalAge. Since redshift is uncertain, the utput GalAge(redshift) is als uncertain. Q: Select G.bjID, GalAge(G.redshift) Frm Galaxy G A mre cmplex example f using UDFs is shwn in query Q2, which cmputes the cmving vlume f tw galaxies whse distance is in sme specific range. This query invkes tw UDFs CmveV l and Distance n uncertain attributes redshift and ps respectively, tgether with a selectin predicate n the utput f the UDF Distance. Q2: Select G.bjID, G2.bjID, Distance(G.ps, G2.ps) CmveVl(G.redshift, G2.redshift, A R E A) Frm Galaxy AS G, Galaxy AS G2 Where Distance(G.ps, G2.ps) [l, u] Prblem Statement. In this wrk, we aim t prvide a general framewrk t supprt UDFs n uncertain data, where the functins are given as black bxes. Specifically, given an input tuple mdeled by a vectr f randm variables X, which is characterized by a jint distributin (either cntinuus r discrete), and a univariate, black-bx UDF f, ur bjective is t characterize the distributin f Y = f(x). In the example f Q2, after the jin between G and G 2, each tuple carries a randm vectr, X = {G.ps, G.redshift, G 2.ps, G 2.redshift,...}, and 469

3 tw UDFs prduce Y = Distance(G.ps, G 2.ps) and Y 2 = CmveV l(g.redshift, G 2.redshift, AREA). Given the nature f ur UDFs, exact derivatin f result distributins may nt be feasible, and hence apprximatin techniques will be explred. A related requirement is that the prpsed slutin must be able t meet user-specified accuracy gals. In additin, the prpsed slutin must be able t perfrm efficiently in an nline fashin, fr example, t supprt nline interactive analysis ver a large data set r data prcessing n real-time streams (e.g., t detect trnads r anmalies in sky surveys). Challenges. Supprting UDFs as stated abve pses a number f challenges: () UDFs are ften cmputatinally expensive. Fr such UDFs, any prcessing that incurs repeated functin evaluatin t cmpute the utput will take a lng time t cmplete. (2) When an input tuple has uncertain values, cmputing a UDF n them will prduce a result with uncertainty, which is characterized by a distributin. Cmputing the result distributin, even when the functin is knwn, is a nn-trivial prblem. Existing wrk in statistical machine learning (surveyed in [5]) uses regressin t estimate a functin, but mstly fcuses n deterministic input. Fr uncertain input, existing wrk [2] cmputes nly the mean and variance f the result, instead f the full distributin, and hence is f limited use if this distributin is nt Gaussian (which is ften the case). Other wrk [5] cmputes apprximate result distributins withut bunding apprximatin errrs, thus nt addressing user accuracy requirements. (3) Further, mst f ur target applicatins require using an nline algrithm t characterize result uncertainty f a UDF, where nline means that the algrithm des nt need an ffline training phase befre prcessing data. Relevant machine learning techniques such as [2, 5] belng t ffline algrithms. In additin, a desirable nline algrithm shuld perate with high perfrmance in rder t supprt nline interactive analysis r data stream prcessing. Cntributins. In this paper, we present a cmplete framewrk fr handling user-defined functins n uncertain data. Specifically, ur main cntributins include:. An apprximate evaluatin framewrk ( 2): We prpse a carefully-crafted apprximatin framewrk fr cmputing UDFs n uncertain data, including apprximatin metrics and bjectives. These metrics, namely discrepancy and KS measures, are a natural fit f range queries and intuitive t interpret. While many apprximatin metrics exist in the statistics literature, ur chices f the metrics and bjectives cmbined allw us t prvide new theretical results regarding the errr bunds f utput distributins. 2. Cmputing utput distributins with errr bunds ( 3 and 4): We emply an apprach f mdeling black-bx UDFs using a machine learning technique called Gaussian prcesses (GPs). We chse this technique due t its abilities t mdel functins and quantify the apprximatin in such functin mdeling. Given the GP mdel f a UDF and uncertain input, ur cntributin lies in cmputing utput distributins with errr bunds. In particular, we prvide an algrithm that cmbines the GP mdel f a UDF and Mnte Carl (MC) sampling t cmpute utput distributins. We perfrm an in-depth analysis f the algrithm and derive new theretical results fr quantifying the apprximatin f the utput, including bunding the errrs f bth apprximatin f the UDF and sampling frm input distributins. These errr bunds can be used t tune ur mdel t meet accuracy requirements. T the best f ur knwledge, this wrk is the first t quantify utput distributins f Gaussian prcesses. 3. An ptimized nline algrithm ( 5): We further prpse an nline algrithm t cmpute apprximate utput distributins that satisfy user accuracy requirements. Our algrithm emplys a suite f ptimizatins f the GP learning and inference mdules t imprve perfrmance and accuracy. Specifically, we prpse lcal inference t increase inference speed while maintaining high accuracy, nline tuning t refine functin mdeling and adapt t input data, and an nline retraining strategy t minimize the training verhead. Existing wrk in machine learning [2, 5, 4, 7] des nt prvide a sufficient slutin t such high-perfrmance nline training and inference while meeting user-specified accuracy requirements. 4. Evaluatin ( 6): We cnduct a thrugh evaluatin f ur prpsed techniques using bth synthetic functins with cntrlled prperties, and real functins frm the astrphysics dmain. Results shw that ur GP techniques can adapt t varius functin cmplexities, data characteristics, and user accuracy gals. Cmpared t MC sampling, ur apprach starts t utperfrm when functin evaluatin takes mre than ms fr lw-dimensinal functins, e.g., up t 2 dimensins, r when functin evaluatin takes mre than ms fr high-dimensinal nes, e.g., dimensins. This result applies t real-wrld expensive functins as we shw using the real UDFs frm astrphysics. Fr the UDFs tested, the GP apprach can ffer up t tw rders f magnitude speedup ver MC sampling. 2. AN APPROXIMATION FRAMEWORK In this sectin, we first prpse a general apprximate evaluatin framewrk, and then present a baseline apprach based n Mnte Carl sampling t cmpute utput distributins f UDFs. 2. Apprximatin Metrics and Objectives Since UDFs are given as black bxes and have n explicit frmula, cmputing the utput f the UDFs can be dne nly thrugh functin evaluatin. Fr uncertain input, cmputing the exact distributin requires functin evaluatin at all pssible input values, which is impssible when the input is cntinuus. In this wrk, we seek apprximatin algrithms t cmpute the utput distributin given uncertain input. We nw present ur apprximatin framewrk including accuracy metrics and bjectives. We adpt tw distance metrics between randm variables frm the statistics literature []: the discrepancy and Klmgrv Smirnv (KS) measures. We chse these metrics because they are a natural fit f range queries, hence allwing easy interpretatin f the utput. Definitin Discrepancy measure. The discrepancy measure, D, between tw randm variables Y and Y is defined as: D(Y, Y ) = sup a,b:a b Pr[Y [a, b]] Pr[Y [a, b]]. Definitin 2 KS measure. The KS measure (r distance) between tw randm variables Y and Y is defined as: KS(Y, Y ) = sup y Pr[Y y] Pr[Y y]. The values f bth measures are in [, ]. It is straightfrward t shw that D(Y, Y ) 2KS(Y, Y ). Bth measures can be cmputed directly frm the cumulative distributin functins (CDFs) f Y and Y t capture their maximum difference. The KS distance cnsiders all ne-sided intervals, i.e., [, c] r [c, ], while the discrepancy measure cnsiders all tw-sided intervals [a, b]. In practice, users may be interested nly in intervals f length at least λ, an applicatin-specific errr level that is tlerable fr the cmputed quantity. This suggests a relaxed variant f the discrepancy measure, as fllws. Definitin 3 λ-discrepancy. Given the minimum interval length λ, the discrepancy measure D λ between tw randm variables Y and Y is: D λ (Y, Y ) = sup a,b:b a λ Pr[Y [a, b]] Pr[Y [a, b]]. This measure can be interpreted as: fr all intervals f length at least λ, the prbability f an interval under Y des nt differ frm that 47

4 under Y by mre than D λ. These distance metrics can be used t indicate hw well ne randm variable Y apprximates anther randm variable Y. We next state the ur apprximatin bjective, (ɛ, δ)-apprximatin, using the discrepancy metric; similar definitins hld fr the λ-discrepancy and the KS metric. Definitin 4 (ɛ, δ)-apprximatin. Let Y and Y be tw randm variables. Then Y is an (ɛ, δ)-apprximatin f Y iff with prbability ( δ), D(Y, Y ) ɛ. Fr query Q, (ɛ, δ)-apprximatin requires that with prbability ( δ), the apprximate distributin f GalAge(G.redshift) des nt differ frm the true ne mre than ɛ in discrepancy. Fr Q2, there is a selectin predicate in the WHERE clause, which truncates the distributin f Distance(G.ps, G2.ps) t the regin [l, u], and hence yields a tuple existence prbability (TEP). Then, (ɛ, δ)-apprximatin requires that with prbability ( δ), (i) the apprximate distributin f Distance(G.ps, G2.ps) differs frm the true distributin by at mst ɛ in discrepancy measure, and (ii) the result TEP differs frm the true TEP by at mst ɛ. 2.2 A Baseline Apprach We nw present a simple, standard technique t cmpute the query results based n Mnte Carl (MC) simulatin. Hwever, as we will see, this apprach may require evaluating the UDF many times, which is inefficient fr slw UDFs. This inefficiency is the mtivatin fr ur new apprach presented in Sectins 3 5. A. Cmputing the Output Distributin. In recent wrk [23], we use Mnte Carl simulatin t cmpute the utput distributin f aggregates n uncertain input. This technique can als be used t cmpute any UDF Y = f(x). The idea is simple: draw the samples frm the input distributin, and perfrm functin evaluatin t get the utput samples. The algrithm is as fllws. Algrithm Mnte Carl simulatin : Draw m samples x... x m p(x). 2: Cmpute the utput samples, y = f(x ),..., y m = f(x m). 3: Return the empirical CDF f the utput samples, namely Y, Pr(Y y) = m i [..m] [y i, )(y). ( ) is the indicatr functin. It is shwn in [23] that if m = ln(2δ )/(2ɛ 2 ), then the utput Y is an (ɛ, δ)-apprximatin f Y in terms f KS measure, and (2ɛ, δ)-apprximate in terms f discrepancy measure. Thus, the number f samples required t reach the accuracy requirement ɛ is prprtinal t /ɛ 2, which is large fr small ɛ. Fr example, if we use the discrepancy measure and set ɛ =.2, δ =.5, then m required is mre than 8. B. Filtering with Selectin Predicates. In many applicatins, users are interested in the event that the utput is in certain intervals. This can be expressed with a selectin predicate, e.g., f(x) [a, b], as shwn in query Q2. When the prbability ρ = Pr[f(X) [a, b]] is smaller than a user-specified threshld θ, this crrespnds t an event f little interest and can be discarded. Fr high perfrmance, we wuld like t quickly check whether ρ < θ fr filtering, which in turn saves the cst frm cmputing the full distributin f(x). While drawing the samples as in Algrithm, we derive a cnfidence interval fr ρ t decide whether t filter. By definitin we have ρ = (a f(x) b)p(x)dx. Let h(x) = (a f(x) b) and m be the number f samples drawn s far ( m m). And let {h i i =... m} be the samples evaluated n h(x). Then, h i are iid, Bernulli samples, and ρ can be estimated by ρ, cmputed frm mi= h i the samples, ρ =. The fllwing result, which can be derived frm the Heffding s inequality in statistics, gives a cnfidence m interval fr ρ. Remark 2. With prbability ( δ), ρ [ ρ ɛ, ρ + ɛ], where ɛ = ln 2 2 m δ. If the user specifies a threshld θ t filter lw-prbability events, and ρ + ɛ < θ, then we can drp this tuple frm utput. 3. EMULATING UDFS WITH GAUSSIAN PROCESSES In the next three sectins, we present an apprach that aims t be mre efficient than MC sampling by requiring many fewer calls t the UDF. The main idea is that every time we call the UDF, we gain infrmatin abut the functin. Once we have called the UDF enugh times, we ught t be able t apprximate it by interplating between the knwn values t predict the UDF at unknwn values. We call this predictr an emulatr ˆf, which can be used in place f the riginal UDF f, and is much less expensive fr many UDFs. We briefly mentin hw t build the emulatr using a statistical learning apprach. The idea is that, if we have a set f functin input-utput pairs, we can use it as training data t estimate f. In principle, we culd build the emulatr using any regressin prcedure frm statistics r machine learning, but picking a simple methd like linear regressin wuld wrk prly n a UDF that did nt meet the strng assumptins f that methd. Instead, we build the emulatr using a learning apprach called Gaussian prcesses (GPs). GPs have tw key advantages. First, GPs are flexible methds that can represent a wide range f functins and d nt make strng assumptins abut the frm f f. Secnd, GPs prduce nt nly a predictin ˆf(x) fr any pint x but als a prbabilistic cnfidence that prvides errr bars n the predictin. This is vital because we can use this t adapt the training data t meet the user-specified errr tlerance. Building an emulatr using a GP is a standard technique in the statistics literature; see [5] fr an verview. In this sectin, we prvide backgrund n the basic apprach t building emulatrs. In Sectin 4, we extend t uncertain inputs and aim t quantify the uncertainty f utputs f UDFs. We then prpse an nline algrithm t cmpute UDFs and varius ptimizatins t address accuracy and perfrmance requirements in Sectin Intuitin fr GPs We give a quick intrductin t the use f GPs as emulatrs, clsely fllwing the textbk [8]. A GP is a distributin ver functins; whenever we sample frm a GP, we get an entire functin fr f whse utput is the real line. Fig. (a) illustrates this in ne dimensin. It shws three samples frm a GP, where each is a functin R R. Specifically, if we pick any input x, then f(x) is a scalar randm variable. This lets us get cnfidence estimates, because nce we have a scalar randm variable, we can get a cnfidence interval in the standard way, e.g., mean ± 2standard deviatin. T use this idea fr regressin, ntice that since f is randm, we can als define cnditinal distributins ver f, in particular, cnditinal distributin f f given a set f training pints. This new distributin ver functins is called the psterir distributin, and it is this distributin that lets us predict new values. 3.2 Definitin f GPs Just as the multivariate Gaussian is an analytically tractable distributin ver vectrs, the Gaussian prcess is an analytically tractable distributin ver functins. Just as a multivariate Gaussian is defined by a mean and cvariance matrix, a GP is defined by a mean functin and a cvariance functin. The mean functin m(x) gives the average value E[f(x)] fr all inputs x, where the expectatin is taken ver the randm functin f. The cvariance functin k(x, x ) 47

5 utput, f(x) input, x input, x (a) Prir (b) Psterir Figure : Example f GP regressin. (a) prir functins, (b) psterir functins cnditining n training data returns the cvariance between the functin values at tw input pints, i.e., k(x, x ) = Cv(f(x), f(x )). A GP is a distributin ver functins with a special prperty: if we fix any vectr f inputs (x,..., x n), the utput vectr f = (f(x ), f(x 2),..., f(x n)) has a multivariate Gaussian distributin. Specifically, f N (m, K), where m is the vectr (m(x )... m(x n)) cntaining the mean functin evaluated at all the inputs and K is a matrix f cvariances K ij = k(x i, x j) between all the input pairs. The cvariance functin has a vital rle. Recall that the idea was t apprximate f by interplating between its values at nearby pints. The cvariance functin helps determine which pints are nearby. If tw pints are far away, then their functin values shuld be nly weakly related, i.e., their cvariance shuld be near. On the ther hand, if tw pints are nearby, then their cvariance shuld be large in magnitude. We accmplish this by using a cvariance functin that depends n the distance between the input pints. In this wrk, we use standard chices fr the mean and cvariance functins. We chse the mean functin m(x) =, which is a standard chice when we have n prir infrmatin abut the UDF. Fr the cvariance functin, we use the squared expnential ne, which in its simplest frm is k(x, x ) = σf 2 e 2l 2 x x 2, where is Euclidean distance, and σf 2 and l are its parameters. The signal variance σf 2 primarily determines the variance f the functin value at individual pints, i.e., x = x. Mre imprtant is the lengthscale l, which determines hw rapidly the cvariance decays as x and x mve farther apart. If l is small, the cvariance decays rapidly, s sample functins frm the result GP will have many small bumps; if l is large, then these functins will tend t be smther. The key assumptin made by GP mdeling is that at any pint x, the functin value f(x) can be accurately predicted using the functin values at nearby pints. GPs are flexible t mdel different types f functins by using an apprpriate cvariance functin [8]. Fr instance, fr smth functins, squared-expnential cvariance functins wrk well; fr less smth functins, Matern cvariance functins wrk well (where smthness is defined by mean-squared differentiability ). In this paper, we fcus n the cmmn squaredexpnential functins, which are shwn experimentally t wrk well fr the UDFs in ur applicatins (see 6.4). In general, the user can chse a suitable cvariance functin based n the well-defined prperties f UDFs, and plug it int ur framewrk. 3.3 Inference fr New Input Pints We next describe hw t use a GP t predict the functin utputs at new inputs. Dente the training data by X = {x i i =,..., n} fr the inputs and f = {f i i =,..., n} fr the functin values. In this sectin, we assume that we are tld a fixed set f m test inputs X = (x, x 2,..., x m) at which we wish t predict the functin values. Dente the unknwn functin values at the test pints by f = (f, f 2,..., f m). The vectr (f, f) is a randm vectr because each f i:i=...m is randm, and by the definitin f a GP, this vectr simply has a multivariate Gaussian distributin. This distributin is: utput, f(x) [ ] ( [ ] ) f K(X, X ) K(X, X) N, f K(X, X, () ) K(X, X) where we have written the cvariances as matrix with fur blcks. The blck K(X, X) is an n m matrix f the cvariances between all training and test pints, i.e., K(X, X) ij = k(x i, x j). Similar ntins are fr K(X, X ), K(X, X), and K(X, X ). Nw that we have a jint distributin, we can predict the unknwn test utputs f by cmputing the cnditinal distributin f f given the training data and test inputs. Applying the standard frmula fr the cnditinal f a multivariate Gaussian yields: f X, X, f N (m, Σ), where (2) m = K(X, X )K(X, X ) f Σ = K(X, X) K(X, X )K(X, X ) K(X, X) T interpret m intuitively, imagine that m =, i.e., we wish t predict nly ne utput. Then K(X, X )K(X, X ) is an n- dimensinal vectr, and the mean m(x) is the dt prduct f this vectr with the training values f. S m(x) is simply a weighted average f the functin values at the training pints. A similar intuitin hlds when there is mre than ne test pint, m >. Fig. (b) illustrates the resulting GP after cnditining n training data. As bserved, the psterir functins pass thrugh the training pints marked by the black dts. The sampled functins als shw that the further a pint is frm the training pints, the larger the variance is. We nw cnsider the cmplexity f this inference step. Nte that nce the training data is cllected, the inverse cvariance matrix K(X, X ) can be cmputed nce, with a cst f O(n 3 ). Then given a test pint x (r X has size ), inference invlves cmputing K(X, X ) and multiplying matrices, which has a cst f O(n 2 ). The space cmplexity is als O(n 2 ), fr string these matrices. 3.4 Learning the Hyperparameters Typically, the cvariance functins have sme free parameters, called hyperparameters, such as the lengthscale l f the squaredexpnential functin. The hyperparameters determine hw quickly the cnfidence estimates expand as test pints mve further frm the training data. Fr example, in Fig. (b), if the lengthscale decreases, the spread f the functin will increase, meaning that there is less cnfidence in the predictins. We can learn the hyperparameters using the training data (see Chapter 5, [8]). We adpt maximum likelihd estimatin (MLE), a standard technique fr this prblem. Let θ be the vectr f hyperparameters. The lg likelihd functin is L(θ) := lg p(f X, θ) = lg N (X ; m, Σ); here we use N t refer t the density f the Gaussian distributin, and m and Σ are defined in Eq. (2). MLE slves fr the value f θ that maximizes L(θ). We use gradient descent, a standard methd fr this task. Its cmplexity is O(n 3 ) due t the cst f inverting the matrix K(X, X ). Gradient descent requires many steps t cmpute the ptimal θ; thus, retraining ften has a high cst fr large numbers f training pints. Nte that when the training data X changes, θ that maximizes the lg likelihd L(θ) may als change. Thus, ne wuld need t maximize the lg likelihd t update the hyperparameters. In 5.3, we will discuss retraining strategies that aim t reduce this cmputatin cst. 4. UNCERTAINTY IN QUERY RESULTS S far in ur discussins f GPs, we have assumed that all the input values are knwn in advance. Hwever, ur wrk aims t cmpute UDFs n uncertain input. In this sectin, we describe hw 472

6 f, ˆf, f true functin, mean functin f the GP, and a sample functin f the GP, respectively. f L, f S upper and lwer envelpe functins f f (with high prbability) Y, Ŷ, Ỹ utput crrespnding t f, ˆf, f, respectively. Y L, Y S utput crrespnding t f L, f S, respectively. Ŷ estimate f Ŷ using MC sampling. (Similarly fr Y L and Y S) ρ, ˆρ prbability f Ỹ and Ŷ, in a given interval [a, b]. ρ U, ρ L upper and lwer bunds f ρ (with high prb.). ρ, ˆρ, ρ U, ρ L MC estimates f ρ, ˆρ, ρ U and ρ L respectively. n number f training pints. m number f MC samples. Table : The main ntatin used in GP techniques. GP (distributins f functins) y mean functin X (b) sample functin (a) ^f(x) upper ~ f(x) lwer ^ f(x)+zασ(x) (YL) ^ ^ f(x) (Y) ^ f(x)-zασ(x) (YS) x ^ f(x)+zασ(x) ^ f(x)-zασ(x) Pr (Y y) Uncertain X YS Y^ YL YGP YS a MC Sampling ^ Y b (c) Figure 2: GP inference fr uncertain input. (a) Cmputatin steps (b) Apprximate functin with bunding envelpe (c) Cmputing prbability fr interval [a, b] frm CDFs t cmpute utput distributins using a GP emulatr given uncertain input. We then derive theretical results t bund the errrs f the utput using ur accuracy metrics. 4. Cmputing the Output Distributin We first describe hw t apprximate the UDF utput Y = f(x) given uncertain input X. When we apprximate f by the GP emulatr ˆf, we have a new apprximate utput Ŷ = ˆf(X), having CDF, Pr[Ŷ y] = ( ˆf(x) y)p(x)dx. This integral cannt be cmputed analytically. Instead, a simple, ffline algrithm is t use Mnte Carl integratin by repeatedly sampling input values frm p(x). This is very similar t Algrithm, except that we call the emulatr ˆf rather than the UDF f, which is a cheaper peratin fr lng-running UDFs. The algrithm is detailed as belw. Algrithm 2 Offline algrithm using Gaussian prcesses : Cllect n training data pints, {(x i, yi ), i =..n} by evaluating yi = f(x ) 2: Learning a GP via training using the n training data pints, t get GP ( ˆf( ), k(, )). 3: Fr uncertain input, X p(x): 4: Draw m samples, x,..., x m, frm the distributin p(x). 5: Predict functin values at the samples via GP inference t get {( ˆf(x i), σ 2 (x i)), i =..m} 6: Cnstruct the empirical CDF f Ŷ frm the samples, namely Ŷ, Pr(Ŷ y) = m i [..m] [ ˆf i, ) (y), and return Ŷ. In additin t returning the CDF f Ŷ, we als want t return a cnfidence f hw clse Ŷ is t the true answer Y. Ideally, we wuld d this by returning the discrepancy metric, D(Ŷ, Y ). But it is difficult t evaluate D(Ŷ, Y ) withut many calls t the UDF YL ^ Y' Y'L Y'GP Y'S y f, which wuld defeat the purpse f using emulatrs. S instead we ask a different questin, which is feasible t analyze. The GP defines a psterir distributin ver functins, and we are using the psterir mean as the best emulatr. The questin we ask is hw different wuld the query utput be if we emulated the UDF using a randm functin frm the GP, rather than the psterir mean? If this difference is small, this means the GP s psterir distributin is very cncentrated. In ther wrds, the uncertainty in the GP mdeling is small, and we d nt need mre training data. T make this precise, let f be a sample frm the GP psterir distributin ver functins, and define Ỹ = f(x) (see Fig. 2a fr an illustratin fr these variables). That is, Ỹ represents the query uput if we select the emulatr randmly frm the GP psterir distributin. The cnfidence estimate that we will return will be an upper bund n D(Ŷ, Ỹ ). 4.2 Errr Bunds Using Discrepancy Measure We nw derive a bund n the discrepancy D(Ŷ, Ỹ ). An imprtant pint t nte is that there are tw surces f errr here. The first is the errr due t Mnte Carl sampling f the input and the secnd is the errr due t the GP mdeling. In the analysis that fllws, we bund each surce f errr individually and then cmbine them t get a single errr bund. T the best f ur knwledge, this is the first wrk t quantify the utput distributins f GPs. The main idea is that we will cmpute a high prbability envelpe ver the GP predictin. That is, we will find tw functins f L and f S such that f S f f L with prbability at least ( α), fr a given α. Once we have this envelpe n f, then we als have a high prbability envelpe f Ỹ, and can use this t bund the discrepancy. Fig. 2 (parts b & c) gives an illustratin f this intuitin. Bunding Errr fr One Interval. T start, assume that we have already cmputed a high prbability envelpe. Since the discrepancy invlves a supremum ver intervals, we start by presenting upper and lwer bunds n ρ := Pr[Ỹ [a, b] f] fr a single fixed interval [a, b]. Nw, ρ is randm because f is; fr every different functin f we get frm the GP psterir, we get a different ρ. Fr any envelpe (f S, f L), e.g., having the frm ˆf(x)±zσ(x) as shwn in Fig. 2, define Y S = f S(X) and Y L = f L(X). We bund ρ (with high prbability) using Y S and Y L. Fr any tw functins g and h, and any randm vectr X, it is always true that g h implies that Pr[g(X) a] Pr[h(X) a] fr all a. Putting this tgether with f S f f L, we have that ρ = Pr[ f(x) b] Pr[ f(x) a] Pr[f S (X) b] Pr[f L (X) a] In ther wrds, this gives the upper bund: ρ ρ U := Pr[Y S b] Pr[Y L a] (3) Similarly, we can derive the lwer bund: ρ ρ L := max(, Pr[Y L b] Pr[Y S a]) (4) This is summarized in the fllwing result. Prpsitin 4. Suppse that f S and f L are tw functins such that f S f f L with prbability ( α). Then ρ L ρ ρ U, with prbability ( α), where ρ U and ρ L are as in Eqs. 3 and 4. Bunding λ-discrepancy. Nw that we have the errr bund fr ne individual interval, we use this t bund the λ-discrepancy D λ (Ỹ, Ŷ ). Using the bunds f ρ, we can write this discrepancy as D λ (Ỹ, Ŷ ) = sup ρ ˆρ sup max{ ρ L ˆρ, ρ U ˆρ }, [a,b] [a,b] where the inequality applies the result frm Prpsitin 4.. This is prgress, but we cannt cmpute ρ L, ρ U, r ˆρ exactly because they 473

7 Algrithm 3 Cmpute λ-discrepancy errr bund : Cnstruct the empirical CDFs, Ŷ, Y S and Y L, frm the utput samples. Let V be the set f values f these variables. 2: Precmpute max b b (Pr[Ŷ b] Pr[Y L b]) and max b b (Pr[Y S b] Pr[Ŷ b]) b V. 3: Cnsider values fr a, s.t. [a, a + λ] lies in the supprt f Ŷ. a is in V, enumerated frm small t large. 4: Fr a given a: (a) Get Pr[Ŷ a], Pr[Y S a], and Pr[Y L a]. (b) Get max b a+λ (Pr[Y S b] Pr[Ŷ b]). Find smallest b s.t. Pr[Y L b ] Pr[Y S a], and then get max b b (Pr[Ŷ b] Pr[Y L b]). This is dne by using the precmputed values in Step 2. (c) Cmpute max(ρ U ˆρ, ˆρ ρ L) frm the quantities in (a) and (b). This is the errr bund fr intervals starting with a. 5: Increase a, repeat step 4, and update the maximum errr. 6: Return the maximum errr fr all a, which is ɛ GP. require integrating ver the input X. S we will use Mnte Carl integratin nce again. We cmpute Y L and Y S, as MC estimates f Y L and Y S respectively, frm the samples in Algrithm 2. We als define (but d nt cmpute) Ỹ, the randm variable resulting frm MC apprximatin f Ỹ with the same samples. An identical argument t that f Prpsitin 4. shws that D λ (Ỹ, Ŷ ) = sup ρ ˆρ sup max{ ρ L ˆρ, ρ U ˆρ } := ɛ GP, [a,b] [a,b] where adding a prime means t use Mnte Carl estimates. Nw we present an algrithm t cmpute ɛ GP. The easiest way wuld be t simply enumerate all pssible intervals. Because Ŷ, Y S, and Y L are empirical CDFs ver m samples, there are O(m 2 ) pssible values fr ρ U, ρ L, and ˆρ. This can be inefficient fr large numbers f samples m, as we bserved empirically. Instead, we present a mre efficient algrithm t cmpute this errr bund, as shwn in Algrithm 3. The main idea is t (i) precmpute the maximum differences between the mean functin and each envelpe functin cnsidering decreasing values f b (Step 2), then (ii) enumerate the values f a increasingly and use the precmputed values t bund ρ fr intervals starting with a (Steps 3-5). This invlves taking a pass thrugh the m pints in the empirical CDF f Ŷ. Then fr a given value f a, use binary search t find the smallest b s.t. Pr[Y L b ] Pr[Y S a]. The cmplexity f this algrithm is O(m lg m). Mre details are available in [24]. Cmbining Effects f Tw Surces f Errr. What we return t the users is the distributin f Ŷ, frm which ˆρ can be cmputed fr any interval. As nted, there are tw surces f errr in ˆρ : the GP mdeling errr and the MC sampling errr. The latter arises frm having Ŷ, Y L, and Y S t apprximate Ŷ, YL, and YS respectively. The GP errr is frm using the mean functin t estimate ρ. We can cmbine these int a single errr bund n the discrepancy: D λ (Ŷ, Ỹ ) D λ(ŷ, Ỹ ) + D λ (Ỹ, Ỹ ). This fllws frm the triangle inequality that D λ satisfies because it is a metric. Abve we just shwed that D λ (Ŷ, Ỹ ) ɛ GP. Furthermre, D λ (Ỹ, Ỹ ) is just the errr due t a standard Mnte Carl apprximatin, which, as discussed in 2, can be bunded with high prbability by, say, ɛ MC, depending n the number f samples. Als, the tw surces f errr are independent. This yields the main errr bund f this paper, which we state as fllws. Therem 4. If MC sampling is (ɛ MC, δ MC)-apprximate and GP predictin is (ɛ GP, δ GP )-apprximate, then the utput has an errr bund f (ɛ MC + ɛ GP ) with prbability ( δ MC)( δ GP ). Cmputing Simultaneus Cnfidence Bands. Nw we describe hw t chse a high prbability envelpe, i.e., a pair (f S, f L) that cntains f with prbability α. We will use a band f the frm f S = ˆf(x) z ασ(x) and f L = ˆf(x) + z ασ(x). The prblem is t chse z α. An intuitive chice wuld be t chse z α based n the quantiles f the univariate Gaussian, e.g., chse z α = 2 fr a 95% cnfidence band. This wuld give us a pint-wise cnfidence band, i.e., at any pint x, we wuld have f S(x) f(x) f L(x). But we need smething strnger. Rather, we want (f S, f L) such that the prbability that f S(x) f(x) f L(x) at all inputs x simultaneusly is at least α. An envelpe with this prperty is called a simultaneus cnfidence band. We will still use a band f the frm ˆf(x) ± z ασ(x), but we will need t chse a z α large enugh t get a simultaneus cnfidence band. Say we set z α t sme value z. The cnfidence f(x) ˆf(x) band is satisfied if Z(x) := z fr any x. Therefre, σ(x) if the prbability f sup x X Z(x) z is small, the cnfidence band is unlikely t be vilated. We adpt an apprximatin f this prbability due t [3], i.e., Pr[sup Z(x) z] E[ϕ(A z(x)], (5) x X where the set A z(x) := {x X : Z(x) z} is the set f all inputs where the cnfidence band is vilated, and ϕ(a) is the Euler characteristic f the set A. Als, [3] prvides a numerical methd t apprximate Eq. (5) that wrks well fr small α, i.e., high prbability that the cnfidence band is crrect, which is precisely the case f interest. The details are smewhat technical, and are mitted fr space; see [3, 24]. Overall, the main cmputatinal expense is that the apprximatin requires cmputing secnd derivatives f the cvariance functin, but we have still fund it t be feasible in practice. Once we cmputed the apprximatin t Eq. (5), we cmpute the cnfidence band by setting z α t be the slutin f the equatin Pr[sup x X Z(x) z α] E[ϕ(A z(x)] = α. 4.3 Errr Bunds fr KS Measure The abve analysis can be applied if the KS distance is used as the accuracy metric in a similar way. The main result is as fllws. Prpsitin 4.2 Cnsider the mean functin ˆf(x) and the envelpe ˆf(x) ± zσ(x). Let f(x) be a functin in the envelpe. Given uncertain input X, let Ŷ = ˆf(X) and Ỹ = f(x). Then KS(Ỹ, Ŷ ) is largest when f(x) is at either the bundary f the envelpe. Prf sketch. Recall that KS(Ỹ, Ŷ ) = sup y Pr[Ỹ y] Pr[Ŷ y]. Let ym crrespnd t the supremum in the frmula f KS. Wlg, let KS = ([ ˆf(x) y m] [ f(x) y m])p(x)dx >. That is, fr sme x, ˆf(x) y m < f(x). Nw suppse there exists sme x s.t. f(x ) < ˆf(x ), the KS distance wuld increase if ˆf(x ) f(x ). This means, KS becmes larger when f(x) ˆf(x) fr all x; r, f(x) lies abve ˆf(x) fr all x. Als, it is intuitive t see that amng the functins that lie abve ˆf(x), ˆf(x) + zσ(x) yields the largest KS errr, since it maximizes [ ˆf(x) y] [ f(x) y], y. (Similarly, we can shw that if KS = ([ f(x) y m] [ ˆf(x) y m])p(x)dx >, KS is maximized if f(x) lies belw ˆf(x) fr all x.) 474

8 : input sample : training pint xfar xnear input sample bunding bx lcal training pint bunding bx training pint Figure 3: Chsing a subset f training pints fr lcal inference As a result, let Y S and Y L be the utput cmputed using the upper and lwer bundaries ˆf(x) ± zσ(x) respectively. Then, the KS errr bund is max(ks(ŷ, YS), KS(Ŷ, YL)) We can btain the empirical variables Ŷ, Y S, and Y L via Mnte Carl sampling as befre. We als analyze the cmbining effects f the tw surces f errr, MC sampling and GP mdeling, as fr the discrepancy measure. We btain a similar result: the ttal errr bund is the sum f the tw errr bunds, ɛ MC and ɛ GP. The prf is mitted due t space cnstraints but available in [24]. 5. AN OPTIMIZED ONLINE ALGORITHM In Sectin 4., we present a basic algrithm (Algrithm 2) t cmpute utput distributins when Gaussian prcesses mdel ur UDFs. Hwever, this algrithm des nt satisfy ur design cnstraints as fllws. This is an ffline algrithm since the training data is fixed and learning is perfrmed befre inference. Given an accuracy requirement, it is hard t knw the number f training pints, n, needed befrehand. If we use larger n, the accuracy is higher, but the perfrmance suffers due t bth the training cst O(n 3 ) and the inference cst O(n 2 ). We nw seek an nline algrithm that is rbust t UDFs and input distributins in meeting accuracy requirements. We further ptimize it fr high perfrmance. 5. Lcal Inference We first prpse a technique t reduce the cst f inference while maintaining gd accuracy. The key bservatin is that the cvariance between tw pints x i and x j is small when the distance between them is large. Fr example, the squared-expnential cvariance functin decreases expnentially in the squared distance, k(x i, x j) = σf 2 exp{ x i x j 2 }. Therefre, the far training l pints have nly small weights 2 in the weighted average, and hence can be mitted. This suggests a technique that we call lcal inference with the steps shwn in Algrithm 4. (We refer t the standard inference technique as glbal inference.) Algrithm 4 Lcal inference Input: Input distributin p(x). Training data: {(x i, y i ), i =... n}, stred in an R-tree. : Draw m samples frm the input distributin p(x) and cnstruct a bunding bx fr the samples. 2: Retrieve a set f training pints, called X L, that have distance t the bunding bx less than a maximum distance specified by the lcal inference threshld Γ (discussed mre belw). 3: Run inference using X L t get the functin values at the samples. Return the CDF cnstructed frm the inferred values. Fig. 3 illustrates the executin f lcal inference t select a subset f training pint given the input distributin. The darker rectangle is the bunding bx f the input samples, and the lighter rectangle includes the training pints selected fr lcal inference. Chsing the training pints fr lcal inference given a threshld. The threshld Γ is chsen s that the apprximatin errr in ˆf(x j), fr all samples x j, is small. That is, ˆf(x j) when cmputed using either glbal r lcal inference des nt differ much. Revisit glbal inference as in Eq. 2. The vectr K(X, X ) y, called α, can be updated nce the training data changes, and stred fr later inference. Then, cmputing ˆf(x j) = K(x j, X )K(X, X ) y = K(x j, X )α invlves a vectr dt prduct. Nte that the cst f cmputing this mean is O(n); the high cst f inference O(n 2 ) is due t cmputing the variance σ 2 (x j) (see 3.3 fr mre detail). If we use a subset f training pints, we apprximate ˆf(x j) with ˆf L(x j) = K(x j, X L)α L. (α L is the same as α except that the entries in α that d nt crrespnd t a selected training pint are set t ). Then the apprximate errr γ j, fr the sample j, is: γ j K(x j, X )α K(x j, X L)α L = K(x j, X L)α L = l L k(x j, x l )α l, where X L are the training pints excluded frm lcal inference. Ultimately, we want t cmpute γ = max j γ j, which is the maximum errr ver all the samples. The cst f cmputing γ by cnsidering every j is O(mn), as j =...m, which is high fr large m. We next present a mre efficient way t cmpute an upper bund fr γ. We use a bunding bx fr all the samples x j as cnstructed during lcal inference. Fr any training pint with index l, x l, let x near be the clsest pint frm the bunding bx t x l and x far be the furthest pint frm the bunding bx t x l (see Fig. 3 fr an example f these pints). Fr any sample j we have: k(x far, x l ) k(x j, x l ) k(x near, x l ) Next, by multiplying with α l, we have the upper and lwer bunds fr k(x j, x l )α l. With these inequalities, we can btain an upper bund γ upper and lwer bund γ lwer fr γ j, j. Then, γ = max γ j j max( γ upper, γ lwer ) Cmputing this takes time prprtinal t the number f excluded training pints, which is O(n). Fr each f these pints, we need t cnsider the sample bunding bx, which incurs a cnstant cst when the dimensin f the functin is fixed. After cmputing γ, we cmpare it with the threshld Γ. If γ > Γ, we expand the bunding bx fr selected training pints and recmpute γ until we have γ Γ. Nte that Γ shuld be set t be small cmpared with the dmain f Y, i.e., the errr incurred fr every test pint is small. In 6, we shw hw t set Γ t btain gd perfrmance. We mentin an implementatin detail t make the bund γ tighter, which can result in fewer selected training pints fr imprved perfrmance. We divide the sample bunding bx int smaller nn-verlapping bxes as shwn in Fig. 3. Then fr each bx, we cmpute its γ, and then return the maximum f all these bxes. Cmplexity fr lcal inference. Let l be the number f selected training pints; the cst f inference is O(l 3 +ml 2 +n). O(l 3 ) is t cmpute the inverse matrix K(X L, X L) needed in the frmula f variance; O(ml 2 ) is t cmpute the utput variance; and O(n) is t cmpute γ while chsing the lcal training pints. Amng the csts, O(ml 2 ) is usually dminant (esp. fr high accuracy requirement). This is an imprvement cmpared t glbal inference, which has a cst f O(mn 2 ), because l is usually smaller than n. 5.2 Online Tuning Our bjective is t seek an nline algrithm fr GPs: we start with n training pints and cllect them ver time s that the functin mdel gets mre accurate. We can examine each input distributin n-the-fly t see whether mre training pints are needed given 475

9 an accuracy requirement. This cntrasts with the ffline apprach where the training data must be btained befre inference. T develp an nline algrithm, we need t make tw decisins. The first decisin is hw many training pints t add. This is a task related t the errr bunds frm 4, that is, we add training pints until the upper bund n the errr is less than the user s tlerance level. The secnd decisin is where the training pints shuld be, specifically, what input lcatin x n+ t use fr the next training pint. A standard methd is t add new training pints where the functin evaluatin is highly uncertain, i.e., σ 2 (x) is large. We adpt a simple heuristic fr this: we cache the Mnte Carl samples thrughut the algrithm, and when we need mre training pints, we chse the sample x j that has the largest predicted variance σ 2 (x j), cmpute its true functin value f(x j), and add it t the training data set. After that, we run inference, cmpute the errr bund again, and repeat until the errr bund is small enugh. We have experimentally bserved that this simple heuristic wrks well. A cmplicatin is that when we add a new training pint, the inverse cvariance matrix gets bigger K(X, X ), s it needs t be recmputed. Recmputing it frm scratch wuld be expensive, i.e., O(n 3 ). Frtunately, we can update it incrementally using the standard frmula fr inverting a blck matrix (see [24] fr details). 5.3 Online Retraining In ur wrk, the training data is btained n the fly. Since different inputs crrespnd t different regins f the functin, we may need t tune the GP mdel t best fit the up-t-date training data, i.e., t retrain. A key questin is when we shuld perfrm retraining (as mentined in 3.4). It is preferable that retraining is dne infrequently due t its high cst f O(n 3 ) in the number f training pints and multiple iteratins required. The prblem f retraining is less cmmnly addressed in existing wrk fr GPs. Since retraining invlves maximizing the likelihd functin L(θ), we will make this decisin by examining the likelihd functin. Recall als that the numerical ptimizer, e.g., gradient descent, requires multiple iteratins t find the ptimum. A simple heuristic is t run training nly if the ptimizer is able t make a big step during its very first iteratin. Given the current hyperparameters θ, run the ptimizer fr ne step t get a new setting θ, and cntinue with training nly if θ θ is larger than a pre-set threshld θ. In practice, we have fund that gradient descent des nt wrk well with this heuristic, because it des nt mve far enugh during each iteratin. Instead, we use a mre sphisticated heuristic based n a numerical ptimizer, called Newtn s methd, which uses bth the first and the secnd derivatives f L(θ). Mathematical derivatin shws that secnd derivatives f L(θ) are: L (θ j ) = 2 tr[( K θ j y y T K +K y y T K θ j K ) K θ j θ j + (K y y T K K ) 2 K θj 2 ], where tr[ ] is the trace f a matrix. K/ θ j and 2 K/ θ 2 j can be updated incrementally. (The details are shwn in [24].) 5.4 A Cmplete Online Algrithm We nw put tgether all f the abve techniques t frm a cmplete nline algrithm t cmpute UDFs n uncertain data using GPs. The main idea is, starting with n training data, given an input distributin, we use nline tuning in 5.2 t btain mre training data, and run inference t cmpute the utput distributin. Lcal inference in 5. is used fr imprved perfrmance. When sme training pints are added, we use ur retraining strategy t decide whether t relearn the GP mdel by updating its hyperparameters. Algrithm 5 OLGAPRO: Cmpute utput distributin using Gaussian prcess with ptimizatins Input: Input tuple X p(x). Training data: T = {(x i, y i ), i =..n}; hyperparameters f the GP: θ. Accuracy requirement fr the discrepancy measure: (ɛ, δ). : Draw m samples fr X, {x j, j =..m}, where m depends n the sampling errr bund ɛ MC < ɛ. 2: Cmpute the bunding bx fr these samples. Retrieve a subset f training pints fr lcal inference given the threshld Γ (see 5.). Dente this set f training pint T Γ. 3: repeat 4: Run lcal inference using T Γ t get the utput samples {( ˆf(x j), σ 2 (x j)), j =..m}. 5: Cmpute the discrepancy errr bund D upper using these samples (see 4.2). 6: If D upper > ɛ GP, add a new training pint at the sample with largest variance, i.e., (x n+, f(x n+)) (see 5.2), and insert this pint int the training data index. Set n := n +. 7: until D upper ɛ GP 8: if ne r mre training pints are added then 9: Cmpute the lg likelihd L(θ) = lg p(y X, θ) and its first and secnd derivatives, and estimate δ θ (see 5.3). : if δ θ θ then : Retrain t get the new hyperparameters θ. Set θ := θ. 2: Rerun inference. 3: end if 4: end if 5: Return the distributin f Y, cmputed frm samples { ˆf(x j)}. Our algrithm, which we name OLGAPRO, standing fr ONline GAussian PROcess, is shwn as Algrithm 5. The bjective is t cmpute the utput distributin that meets the user-specified accuracy requirement under the assumptin f GP mdeling. The main steps f the algrithm invlve: (a) Cmpute the utput distributin by sampling the input and inferring with the Gaussian prcess (Steps -4). (b) Cmpute the errr bund (Steps 5-7). If this errr bund is larger than the allcated errr bund, use nline tuning t add a new training pint. Repeat this until the errr bund is acceptable. (c) If ne r mre training pints have been added, decide whether retraining is needed and if s perfrm retraining (Steps 8-2). Parameter setting. We further cnsider the parameters used in the algrithm. The chice f Γ fr lcal inference in step 2 is discussed in 5.). The allcatin f tw surces f errr, ɛ MC and ɛ GP is accrding t Therem 4., ɛ = ɛ MC + ɛ GP. Then ur algrithm autmatically chses the number f samples m t meet the accuracy requirement ɛ MC (see 2 fr the frmula). Fr retraining, setting the threshld θ, mentined in 5.3, smaller will trigger retraining mre ften but ptentially make the mdel mre accurate, while setting it high can give inaccurate results. In 6, we experimentally shw hw t set these parameters efficiently. Cmplexity. The cmplexity f lcal inference is O(l 3 + ml 2 + n) as shwn in 5.. Cmputing the errr bund takes O(m lg m) (see 4.2). And, retraining takes O(n 3 ). The number f samples m is O(/ɛ 2 MC), while the number f training pints n depends n ɛ GP and the UDF itself. The unit cst is basic math peratins, in cntrast t cmplex functin evaluatins as in standard MC simulatin. This is because when the system cnverges, we seldmly need t add mre training pints, r t call functin evaluatin. Als, at cnvergence, the high cst f retraining can be avided; the cmputatin needed is fr inference and cmputing errr bunds. 476

10 Hybrid slutin. We nw cnsider a hybrid slutin that cmbines ur tw appraches: direct MC sampling, and GP mdeling and inference. The need fr a hybrid slutin arises since functins can vary in their cmplexity and evaluatin time. Therefre, when given a black-bx UDF, we explre these prperties n the fly and chse the better slutin. We can measure the functin evaluatin time while btaining training data. We then run GPs t cnvergence, measure its inference time, and then cmpare the running times f the tw appraches. Due t space cnstraints, the details f this slutin are deferred t [24]. In 6, we cnduct experiments t determine the cases where each apprach can be applied. 5.5 Online Filtering In the presence f a selectin predicate n the UDF utput, similar t the filtering technique fr Mnte Carl simulatin ( 2), we als cnsider nline filtering when sampling with a Gaussian prcess. Again, we cnsider selectin with the predicate a f(x) b. Let ( ˆf(x), σ 2 (x)) be the estimate at any input pint x. With the GP apprximatin, the tuple existence prbability ρ is apprximated with ˆρ = Pr[ ˆf(x) [a, b]]. This is exactly the quantity that we bunded in 4.2, where we shwed that ρ ρ U. S in this case, we filter tuples whse estimate f ρ U is less than ur threshld. Again, since ρ U is cmputed frm the samples, we can check this nline fr filtering decisin as in PERFORMANCE EVALUATION In this sectin, we evaluate the perfrmance f ur prpsed techniques using bth synthetic functins and data with cntrlled prperties, and real wrklads frm the astrphysics dmain. 6. Experimental Setup We first use synthetic functins with cntrlled prperties t test the perfrmance and sensitivity f ur algrithms. We nw describe the settings f these functins, input data and parameters used. A. Functins. We generate functins (UDFs) f different shapes in terms f bumpiness and spikiness. A simple methd is t use Gaussian mixtures [] t simulate varius functin shapes (which shuld nt be cnfused with the input and utput distributins f the UDF and by n means favrs ur GP apprach). We vary the number f Gaussian cmpnents, which dictates the number f peaks f a functin. The means f the cmpnents determine the dmain, and their cvariance matrix determines the stretch and bumpiness f the functin. We dente the functin dimensinality d; this is the number f input variables f the functin. We bserve that in real applicatins, many functins have lw dimensinality, e.g., r 2 fr astrphysics functins. Fr evaluatin purpses, we vary d in a wider range f [,]. Besides the shape, a functin is characterized by the evaluatin time, T, which we vary in the range µs t s. B. Input Data. By default, we cnsider uncertain data fllwing Gaussian distributins, i.e., the input vectr has distributin characterized by N (µ I, Σ I). µ I is drawn frm the given supprt f the functin [L, U]. Σ I determines the spread f the input distributins. Fr simplicity, we assume the input variables f a functin are independent, but supprting crrelated input is nt harder we just need t sample frm the jint distributins. We als cnsider ther distributins including expnential and Gamma. We nte that handling ther types f distributins is similar due t the same reasn (the difference is the cst f sampling). C. Accuracy Requirement. We use the discrepancy measure as the accuracy metric in ur experiments. The user specifies the accuracy requirement (ɛ, δ) and the minimum interval length λ. λ is set t be a small percentage (e.g., %) f the range f the functin. This requirement means that with prbability ( δ), fr any interval f length at least λ, the prbabilities f an interval cmputed frm the apprximate and true utput distributins d nt differ frm each ther by mre than ɛ. Fr the GP apprach, the errr bund ɛ is allcated t tw surces f errr, GP errr bund ɛ GP and sampling errr bund ɛ MC, where ɛ = ɛ GP + ɛ MC. We als distribute δ s that δ = ( δ GP )( δ MC). Our default setting is as fllws. The dmain f functin [L, U] = [, ], input standard deviatin σ I =.5, functin evaluatin time T = ms, accuracy requirement (ɛ =., δ =.5). The reprted results are averaged frm 5 utput distributins r when the algrithm cnverges, whichever is larger. 6.2 Evaluating ur GP Techniques We first evaluate the individual techniques emplyed in ur Gaussian prcess algrithm, OLGAPRO. The bjective is t understand and set varius internal parameters f ur algrithm. Prfile : Accuracy f functin fitting. We first chse fur twdimensinal functins f different shapes and bumpiness (see Fig. 4). These functins are the fur cmbinatins between (i) ne r five cmpnents, (ii) large r small variance f Gaussian cmpnents, which we refer t as F, F 2, F 3, and F 4. First, we check the effectiveness f GP mdeling. We vary the number f training pints n and run basic glbal inference at test pints. Fig. 5(a) shws the relative errrs fr inference, i.e., ˆf(x) f(x), evaluated at a large f(x) number f test pints. The simplest functin F with ne peak and being flat needs a small number f training pints, e.g., 3, t be well apprximated. In cntrast, the mst bumpy and spiky functin F 4 requires the largest number f pints, n > 3, t be accurate. The ther tw functins are in between. This cnfirms that the GP apprach can mdel functins f different shapes well, hwever the number f training pints needed varies with the functin. In the later experiments, we will shw that OLGAPRO can rbustly determine the number f training pints needed nline. Prfile 2: Behavir f errr bund. We next test the behavir f ur discrepancy errr bund, which is described in 4.2 and cmputed using Algrithm 3. We cmpute the errr bunds and measure the actual errrs. Fig. 5(b) shws the result fr the functin F 4, which cnfirms that the errr bunds are actual upper bunds and hence indicates the validity f GP mdeling. Mre interestingly, it shws hw tight the bunds are (abut 2 t 4 times f the actual errrs). As λ gets smaller, mre intervals are cnsidered fr the discrepancy measure; thus, the errrs and errr bunds, the suprema fr a larger set f intervals, get larger. We test the ther functins and bserve the same trends. In the fllwing experiments, we use a stringent requirement: setting λ t be % f the functin range. Prfile 3: Allcatin f tw surces f errr. We als examine the allcatin f the user-specified errr bund ɛ t the errrs frm GP mdeling and MC sampling, ɛ GP and ɛ MC, as in Therem 4.. The details are mitted due t space cnstraints, but are discussed in [24]. In general, we set ɛ MC t be.7ɛ fr gd perfrmance. In the next three experiments, we evaluate three key techniques emplyed in ur GP apprach. The default functin is F 4. Expt : Lcal inference. We first cnsider ur lcal inference technique as shwn in 5.. We cmpare the accuracy and running time f lcal inference with thse f glbal inference. Fr nw, we fix the number f training pints t cmpare the perfrmance f the tw inference techniques. We vary the threshld Γ f lcal inference frm.% t 2% f the functin range. Recall that setting Γ small crrespnds t using mre training pints and hence similar t glbal inference. Our gal is t chse a setting f Γ s that lcal inference has similar accuracy as glbal inference while being faster. Figs. 5(c) and 5(d) shw the accuracy and running time, 477

11 (a) Funct (b) Funct2 (c) Funct3 (d) Funct4 Figure 4: A family f functins f different smthness and shape used in evaluatin. respectively. We see that fr mst f values Γ tested, lcal inference is as accurate as glbal inference while ffering a speedup frm 2 t 4 times. We repeat this experiment fr ther functins and bserve that fr less bumpy functins, the speedup fr lcal inference is less prnunced, but the accuracy is always cmparable. This is because fr smth functins, far training pints still have a high weight in inference. In general, we set Γ abut (.5x functin range), which results in gd accuracy and imprved running time. Expt 2: Online tuning. In 5.2, we prpsed adding training pints n-the-fly t meet the accuracy requirement. We nw evaluate ur heuristics f chsing samples with the largest variance t add. We cmpare it with tw fllwing heuristics: Given an input distributin, a simple ne is t chse a sample f the input at randm. Anther heuristics is what we call ptimal greedy, which cnsiders all samples, simulates adding each f them t cmpute a new errr bund, and then picks the sample having the mst errr bund reductin. This is nly hypthetical since it is prhibitively expensive t simulate adding every sample. Fr nly this experiment, we assume that each input has 4 samples fr ptimal greedy t be feasible. We start with just 25 training pints and add mre when necessary. Fig. 5(e) shws the accumulated number f training pints added ver time (fr perfrmance, we restrict that n mre than pints can be added fr every input). As bserved, ur technique using the largest variance requires fewer training pints, hence runs faster, than randmly adding pints. Als, it is clse t ur ptimal greedy while being much faster t be run nline. Expt 3: Retraining strategy. We nw examine the perfrmance f ur retraining strategy (see 5.3). We vary ur threshld fr retraining and cmpare this strategy with tw ther strategies: eager training when ne r mre training pints are added, and n training. Again, we start with a small number f training pints and add mre using nline tuning. Figs. 5(f) and 5(g) shw the accuracy and running time respectively. As expected, setting smaller means retraining mre ften and is similar t eager retraining, while larger means less retraining. We see that setting less than.5 gives best perfrmance, as fewer retraining calls are needed while the hyperparameters are still gd estimates. We repeat this experiment with ther functins and see that cnservatively setting =.5 gives gd perfrmance fr this set f functins. In practice, can be chsen in reference with the hyperparameter values. 6.3 GP versus Mnte Carl Simulatin We next examine the perfrmance f ur cmplete nline algrithm, OLGAPRO (Algrithm 5). The internal parameters are set as abve. We als cmpare this algrithm with the MC apprach. Expt 4: Varying user-specified ɛ. We run the GP algrithm fr all fur functins F t F 4. We vary ɛ in the range f [.2,.2]. Fig. 5(h) shws the running time fr the fur functins. (We verify that the accuracy requirement ɛ is always satisfied, and mit the plt due t space cnstraints.) As ɛ gets smaller, the running time increases. This is due t the fact that the number f samples is prprtinal t /ɛ 2 MC. Besides, small ɛ GP requires mre training pints, hence higher cst fr inference. This experiment als verifies the effect f the functin cmplexity n the perfrmance. A flat functin like F needs much fewer training pints than a bumpy, spiky functin like F 4, thus running time is abut tw rders f magnitude different. We als repeat this experiment fr ther input distributins including Gamma and expnential distributins, and bserve very similar results, which is due t ur general apprach f wrking with input samples. Overall, ur algrithm can rbustly adapt t the functin cmplexity and the accuracy requirement. Expt 5: Varying evaluatin time T. The tradeff between the GP and MC appraches mainly lies in the functin evaluatin time T. In this experiment, we fix ɛ =. and vary T frm µs t s. Fig. 5(i) shws the running time f the tw appraches fr all fur functins. Nte that the running time fr MC sampling is similar fr all functins, hence we just shw ne line. As bserved, the GP apprach starts t utperfrm the sampling apprach when functin evaluatin takes lnger than.ms fr simple functins like F, and up t ms fr cmplex functins like F4. Als we nte that ur GP apprach is almst insensitive t functin evaluatin time, which is nly incurred during the early phase. After cnvergence, functin evaluatin ccurs nly infrequently. This demnstrates the applicability f the GP apprach fr lng running functins. This result als argues fr the use f a hybrid slutin as described in 5.4. Since the functin cmplexity is unknwn befrehand, s is the number f training pints. The hybrid slutin can be perfrmed t autmatically pick the better apprach based n the functin s cmplexity and evaluatin time, e.g., the GP methd is used fr simple functins with evaluatin time f.ms r abve, and fr nly cmplex functins with lnger time. Expt 6: Optimizatin fr selectin predicates. We examine the perfrmance f nline filtering when there is a selectin predicate. As shwn in 2 and 5.5, this can be used fr bth direct MC sampling and sampling with a GP. We vary the selectin predicate, which in turn affects the rate that the utput is filtered. We decide t filter utput whse tuple existence prbability is less than.. Fig. 5(j) shws the running time. As seen, when the filtering rate is high, nline filtering helps reduce the running time, by a factr f 5 and 3 times fr MC and GP respectively. We bserve that the GP apprach has a higher speedup because besides prcessing fewer samples, it results in a GP mdel with fewer training pints, r smaller inference cst. Fig. 5(k) shws the false psitive rates, i.e., tuples shuld be filtered but are nt during the sampling prcess. We bserve that this rate is lw, always less than %. The false negative rates are zer r negligible (less than.5%). Expt 7: Varying functin dimensinality d. We cnsider different functins with dimensin d varying frm t. Fig. 5(l) shws the running time f these functins fr bth appraches. Since the running time using GP is insensitive t functin evaluatin time, we shw nly ne line fr T = s fr clarity. We bserve that with GPs, high-dimensinal functins incur high cst, because mre training pints are needed t capture a larger regin. Even with a high dimensin f, the GP apprach still utperfrms MC when the functin evaluatin time reaches.s. 478