Temporally-Biased Sampling for Online Model Management

Size: px

Start display at page:

Download "Temporally-Biased Sampling for Online Model Management"

Sharyl Franklin
5 years ago
Views:

Temporlly-Bised Smpling for Online Model Mngement Brin Hentschel Hrvrd University bhentschel@g.hrvrd.edu Peter J. Hs* University of Msschusetts phs@cs.umss.edu Yunyun Tin IBM Reserch Almden ytin@us.

1 Temporlly-Bised Smpling for Online Model Mngement Brin Hentschel Hrvrd University Peter J. Hs* University of Msschusetts Yunyun Tin IBM Reserch Almden ABSTRACT To mintin the ccurcy of supervised lerning models in the presence of evolving dt strems, we provide temporlly-bised smpling schemes tht weight recent dt most hevily, with inclusion probbilities for given dt item decying exponentilly over time. We then periodiclly retrin the models on the current smple. This pproch speeds up the trining process reltive to trining on ll of the dt. Moreover, time-bising lets the models dpt to recent chnges in the dt while unlike in sliding-window pproch still keeping some old dt to ensure robustness in the fce of temporry fluctutions nd periodicities in the dt vlues. In ddition, the smpling-bsed pproch llows existing nlytic lgorithms for sttic dt to be pplied to dynmic streming dt essentilly without chnge. We provide nd nlyze both simple smpling scheme (T-TBS) tht probbilisticlly mintins trget smple size nd novel reservoir-bsed scheme () tht is the first to provide both complete control over the decy rte nd gurnteed upper bound on the smple size, while mximizing both expected smple size nd smple-size stbility. The ltter scheme rests on the notion of frctionl smple nd, unlike T-TBS, llows for dt rrivl rtes tht re unknown nd time vrying. nd T-TBS re of independent interest, extending the known set of unequl-probbility smpling schemes. We discuss distributed implementtion strtegies; experiments in Sprk illuminte the performnce nd sclbility of the lgorithms, nd show tht our pproch cn increse mchine lerning robustness in the fce of evolving dt. 1 INTRODUCTION A key chllenge for mchine lerning (ML) is to keep ML models from becoming stle in the presence of evolving dt. In the context of the emerging Internet of Things (IoT), for exmple, the dt comprises dynmiclly chnging sensor strems [26], nd filure to dpt to chnging dt cn led to loss of predictive power. One wy to del with this problem is to re-engineer existing sttic supervised lerning lgorithms to become dptive. Some prmetric lgorithms such s SVM cn indeed be re-engineered so tht the prmeters re time-vrying, but for non-prmetric lgorithms such s knn-bsed clssifiction, it is not t ll cler how re-engineering cn be ccomplished. We therefore consider lterntive pproches in which we periodiclly retrin ML models, llowing sttic ML lgorithms to be used in dynmic settings essentilly s-is. There re severl possible retrining pproches. Retrining on cumultive dt: Periodiclly retrining model on ll of the dt tht hs rrived so fr is clerly infesible becuse of the huge volume of dt involved. Moreover, recent Work performed t IBM Reserch Almden 218 Copyright held by the owner/uthor(s). Published in Proceedings of the 21st Interntionl Conference on Extending Dtbse Technology (EDBT), Mrch 26-29, 218, ISBN on OpenProceedings.org. Distribution of this pper is permitted under the terms of the Cretive Commons license CC-by-nc-nd 4.. dt is swmped by the mssive mount of pst dt, so the retrined model is not sufficiently dptive. Sliding windows: A simple sliding-window pproch would be to, e.g., periodiclly retrin on the dt from the lst two hours. If the dt rrivl rte is high nd there is no bound on memory, then one must del with long retrining times cused by lrge mounts of dt in the window. The simplest wy to bound the window size is to retin the lst n items. Alterntively, one could try to subsmple within the time-bsed window [14]. The fundmentl problem with ll of these bounding pproches is tht old dt is completely forgotten; the problem is especilly severe when the dt rrivl rte is high. This cn undermine the robustness of n ML model in situtions where old ptterns cn ressert themselves. For exmple, singulr event such s holidy, stock mrket drop, or terrorist ttck cn temporrily disrupt norml dt ptterns, which will reestblish themselves once the effect of the event dies down. Periodic dt ptterns cn led to the sme phenomenon. Another exmple, from [27], concerns influencers on Twitter: prolific tweeter might temporrily stop tweeting due to trvel, illness, or some other reson, nd hence be completely forgotten in sliding-window pproch. Indeed, in rel-world Twitter dt, lmost qurter of top influencers were of this type, nd were missed by sliding window pproch. Temporlly bised smpling: An ppeling lterntive is temporlly bised smpling-bsed pproch, i.e., mintining smple tht hevily emphsizes recent dt but lso contins smll mount of older dt, nd periodiclly retrining model on the smple. By using time-bised smple, the retrining costs cn be held to n cceptble level while not scrificing robustness in the presence of recurrent ptterns. This pproch ws proposed in [27] in the setting of grph nlysis lgorithms, nd hs recently been dopted in the McroBse system [3]. The orthogonl problem of choosing when to retrin model is lso n importnt question, nd is relted to, e.g., the literture on concept drift [13]; in this pper we focus on the problem of how to efficiently mintin time-bised smple. In more detil, our time-bised smpling lgorithms ensure tht the ppernce probbility for given dt item i.e., the probbility tht the item ppers in the current smple decys over time t controlled exponentil rte. Specificlly, we ssume tht items rrive in btches (see the next section for more detils), nd our gol is to ensure tht (i) our smple is representtive in tht ll items in given btch re eqully likely to be in the smple, nd (ii) if items i nd j belong to btches tht hve rrived t (wll clock) times t nd t with t t, then for ny time t t our smple S t is such tht Pr[i S t ]/Pr[j S t ] = e λ(t t ). (1) Thus items with given timestmp re smpled uniformly, nd items with different timestmps re hndled in crefully controlled mnner. The criterion in (1) is nturl nd ppeling in pplictions nd, importntly, is interpretble nd understndble to users. As discussed in [27], the vlue of the decy rte λ cn be chosen to meet ppliction-specific criteri. For exmple, by setting λ =.58, round 1% of the dt items from 4 btches Series ISSN: /2/edbt

2 go re included in the current nlysis. As nother exmple, suppose tht, k = 15 btches go, n entity such s person or city ws represented by n = 1 dt items nd we wnt to ensure tht, with probbility q =.1, t lest one of these dt items remins in the current smple. Then we would set λ = k 1 ln ( 1 (1 q) 1/n).77. If trining dt is vilble, λ cn lso be chosen to mximize ccurcy vi cross vlidtion. The exponentil form of the decy function hs been dopted by the mjority of time-bised-smpling pplictions in prctice becuse otherwise one would typiclly need to trck the rrivl time of every dt item both in nd outside of the smple nd decy ech item individully t n updte, which would mke the smpling opertion intolerbly slow. (A forwrd decy" pproch tht voids this difficulty, but with its own costs, hs been proposed in [9]; we pln to investigte forwrd decy in future work.) Exponentil decy functions mke updte opertions fst nd simple. For the cse in which the item-rrivl rte is high, the min issue is to keep the smple size from becoming too lrge. On the other hnd, when the incoming btches become very smll or widely spced, the smple sizes for ll of the time-bised lgorithms tht we discuss (s well s for sliding-window schemes bsed on wll-clock time) cn become smll. This is nturl consequence of treting recent items s more importnt, nd is chrcteristic of ny smpling scheme tht stisfies (1). We emphsize tht s shown in our experiments smller, but crefully time-bised smple typiclly yields greter prediction ccurcy thn smple tht is lrger due to overloding with too much recent dt or too much old dt. I.e., more smple dt is not lwys better. Indeed, with respect to model mngement, this decy property cn be viewed s feture in tht, if the dt strem dries up nd the smple decys to very smll size, then this is signl tht there is not enough new dt to relibly retrin the model, nd tht the current version should be kept for now. It is surprisingly hrd to both enforce (1) nd to bound the smple size. As discussed in detil in Section 7, prior lgorithms tht bound the smple size either cnnot consistently enforce (1) or cnnot hndle wll-clock time. Exmples of the former include lgorithms bsed on the A-Res scheme of Efrimidis nd Spirkis [12], nd Cho s lgorithm [5]. A-Res enforces conditions on the cceptnce probbilities of items; this leds to ppernce probbilities which, unlike (1), re both hrd to compute nd not intuitive. A similr exmple is provided by Cho s lgorithm [5]. In Appendix D of [16] we demonstrte how the lgorithm cn be specilized to the cse of exponentil decy nd modified to hndle btch rrivls. We then show tht the resulting lgorithm fils to enforce (1) either when initilly filling up n empty smple or in the presence of dt tht rrives slowly reltive to the decy rte, nd hence fils if the dt rte fluctutes too much. The second type of lgorithm, due to Aggrwl [1] cn only control ppernce probbilities bsed on the indices of the dt items. For exmple, fter n items rrive, one could require tht, with 95% probbility, the (n k)th item should still be in the smple for some specified k < n. If the dt rrivl rte is constnt, then this might correspond to constrint of the form with 95% probbility dt item tht rrived 1 hours go is still in the smple, which is often more nturl in pplictions. For vrying rrivl rtes, however, it is impossible to enforce the ltter type of constrint, nd lrge btch of rriving dt cn premturely flush out older dt. Thus our new smpling schemes re interesting in their own right, significntly expnding the set of unequl-probbility smpling techniques. T-TBS: We first provide nd nlyze Trgeted-Size Time-Bised Smpling (T-TBS), simple lgorithm tht generlizes the smpling scheme in [27]. T-TBS llows complete control over the decy rte (expressed in wll-clock time) nd probbilisticlly mintins trget smple size. Tht is, the expected nd verge smple sizes converge to the trget nd the probbility of lrge devitions from the trget decreses exponentilly or fster in both the trget size nd the devition size. T-TBS is simple nd highly sclble when pplicble, but only works under the strong restriction tht the men dt rrivl rte is known nd constnt. There re scenrios where T-TBS might be good choice (see Section 3), but mny pplictions hve non-constnt, unknown men rrivl rtes or cnnot tolerte smple overflows. : We then provide novel lgorithm, Reservoir-Bsed Time-Bised Smpling (), tht is the first to simultneously enforce (1) t ll times, provide gurnteed upper bound on the smple size, nd llow unknown, vrying dt rrivl rtes. Gurnteed bounds re desirble becuse they void memory mngement issues ssocited with smple overflows, especilly when lrge numbers of smples re being mintined so tht the probbility of some smple overflowing is high or when smpling is being performed in limited memory setting such s t the edge of the IoT. Also, bounded smples reduce vribility in retrining times nd do not impose upper limits on the incoming dt flow. The ide behind is to dpt the clssic reservoir smpling lgorithm, which bounds the smple size but does not llow time bising. Our pproch rests on the notion of frctionl smple whose nonnegtive size is rel-vlued in n pproprite sense. We show tht, over ll smpling lgorithms hving exponentil decy, mximizes the expected smple size whenever the dt rrivl rte is low nd lso minimizes the smple-size vribility. Distributed implementtion: Both T-TBS nd cn be prllelized. Wheres T-TBS is reltively strightforwrd to implement, n efficient distributed implementtion of is nontrivil. We exploit vrious implementtion strtegies to reduce I/O reltive to other pproches, void unnecessry concurrency control, nd mke decentrlized decisions bout which items to insert into, or delete from, the reservoir. Orgniztion: The rest of the pper is orgnized s follows. In Section 2 we formlly describe our btch-rrivl problem setting nd discuss two prior simple smpling schemes: simple Bernoulli scheme s in [27] nd the clssicl reservoir smpling scheme, modified for btch rrivls. These methods either bound the smple size but do not control the decy rte, or control the decy rte but not the smple size. We next present nd nlyze the T-TBS nd lgorithms in Section 3 nd Section 4. We describe the distributed implementtion in Section 5, nd Section 6 contins experimentl results. We review the relted literture in Section 7 nd conclude in Section 8. 2 SETTING AND PRIOR SCHEMES After introducing our problem setting, we discuss two prior smpling schemes tht provide context for our current work: simple Bernoulli time-bised smpling (B-TBS) with no smple-size control nd the clssicl reservoir smpling lgorithm (with no time bising), modified for btch rrivls (B-RS). Setting: Items rrive in btches B 1, B 2,..., t time points t = 1, 2,..., where ech btch contins or more items. This 11

3 simple integer btch sequence often rises from the discretiztion of time [24, 28]. Specificlly, the continuous time domin is prtitioned into intervls of length, nd the items re observed only t times {k : k =, 1, 2,...}. All items tht rrive in n intervl [ k, (k + 1) ) re treted s if they rrived t time k, i.e., t the strt of the intervl, so tht ll items in btch B i hve time stmp i, or simply time stmp i if time is mesured in units of length. As discussed below, our results cn strightforwrdly be extended to rbitrry rel-vlued btch-rrivl times. Our gol is to generte sequence {S t } t, where S t is smple of the items tht hve rrived t or prior to time t, i.e., smple of the items in U t = S ( t i=1 B i ). Here we llow the initil smple S to strt out nonempty. These smples should be bised towrds recent items so s to enforce (1) for i B t nd j B t while keeping the smple size s close s possible to (nd preferbly never exceeding) specified trget n. Our ssumption tht btches rrive t integer time points cn esily be dropped. In ll of our lgorithms, inclusion probbilities nd, s discussed lter, closely relted item weights re updted t btch rrivl time t with respect to their vlues t the previous time t = t 1 vi multipliction by e λ. To extend our lgorithms to hndle rbitrry successive btch rrivl times t nd t, we simply multiply insted by e λ(t t). Thus our results cn be pplied to rbitrry sequences of rel-vlued btch rrivl times, nd hence to n rbitrry sequences of item rrivls (since btches cn comprise single items). Bernoulli Time-Bised Smpling (B-TBS): In the simplest smpling scheme, t ech time t, we ccept ech incoming item x B t into the smple with probbility 1. At ech subsequent time t > t, we flip coin independently for ech item currently in the smple: n item is retined in the smple with probbility p = e λ nd removed with probbility 1 p. It is strightforwrd to dpt the lgorithm to btch rrivls; see Appendix A of [16], where we show tht Pr[x S t ] = e λ(t t) for x B t, implying (1). This is essentilly the lgorithm used, e.g., in [27] to implement time-bised edge smpling in dynmic grphs. The user, however, cnnot independently control the expected smple size, which is completely determined by λ nd the sizes of the incoming btches. In prticulr, if the btch sizes systemticlly grow over time, then smple size will grow without bound. Arguments in [27] show tht if sup t B t <, then the smple size cn be bounded, but only probbilisticlly. See Remrk 1 below for extensions nd refinements of these results. Btched Reservoir Smpling (B-RS): The clssic reservoir smpling lgorithm cn be modified to hndle btch rrivls; see Appendix B of [16]. Although B-RS gurntees n upper bound on the smple size, it does not support time bising. The lgorithm (Section 4) mintins bounded reservoir s in B-RS while simultneously llowing time-bised smpling. 3 TARGETED-SIZE TBS As first step towrds time-bised smpling with controlled smple size, we describe the simple T-TBS scheme, which improves upon the simple Bernoulli smpling scheme B-TBS by ensuring the inclusion property in (1) while providing probbilistic gurntees on the smple size. We require tht the men btch size equls constnt b tht is both known in dvnce nd lrge enough in tht b n(1 e λ ), where n is the trget smple size nd λ is the decy rte s before. The requirement on b ensures tht, t the trget smple size, items rrive on verge t lest s fst s they decy. Algorithm 1: Trgeted-size TBS (T-TBS) 1 λ: decy fctor ( ) 2 n: trget smple size 3 b: ssumed men btch size such tht b n(1 e λ ) 4 Initilize: S S ; p e λ ; q n(1 e λ )/b 5 for t 1, 2,... do 6 m Binomil( S, p) //simulte S trils 7 S Smple(S, m) //retin m rndom elements 8 k Binomil( B t, q) 9 B t Smple(B t, k) //down-smple new btch 1 S S B t 11 output S The pseudocode is given s Algorithm 1. T-TBS is similr to B-TBS in tht we downsmple by performing coin flip for ech item with retention probbility p. Unlike B-TBS, we downsmple the incoming btches t rte q = n(1 e λ )/b, which ensures tht n becomes the equilibrium smple size. Specificlly, when the smple size equls n, the expected number n(1 e λ ) of current items deleted t n updte equls the expected number qb of inserted new items, which cuses the smple size to drift towrds n. Arguing similrly to Appendix A of [16], we hve for t t 1 nd x B t tht Pr[x S t ] = qe λ(t t), so tht the key reltive ppernce property in (1) holds. For efficiency, the lgorithm exploits the fct tht for k independent trils, ech hving success probbility r, the totl number of successes hs binomil distribution with prmeters k nd r. Thus, in lines 6 nd 8, the lgorithm simultes the coin tosses by directly generting the number of successes m or k which cn be done using stndrd lgorithms [17] nd then retining m or k rndomly chosen items. So the function Binomil(j, r) returns rndom smple from the binomil distribution with j independent trils nd success probbility r per tril, nd the function Smple(A,m) returns uniform rndom smple, without replcement, contining min(m, A ) elements of the set A; note tht the function cll Smple(A, ) returns n empty smple for ny empty or nonempty A. Theorem 3.1 below precisely describes the behvior of the smple size; the proof long with the proofs of most other results in the pper is given in Appendix C of [16]. Denote by B t = B t the (possibly rndom) size of B t for t 1 nd by C t = S t the smple size t time t for t ; ssume tht C is finite deterministic constnt. Define the upper-support rtio for rndom btch size B s r = b /b 1, where b = E[B] nd b is the smllest positive number such tht P[B b ] = 1; set r = if B cn be rbitrrily lrge. For r [1, ), set for ϵ > nd ν + ϵ,r = (1 + ϵ) ln ( (1 + ϵ)/r ) (1 + ϵ r). ν ϵ,r = (1 ϵ) ln ( (1 ϵ)/r ) (1 ϵ r) for ϵ (, 1). Note tht ν + ϵ,r > nd is strictly incresing in ϵ for ϵ > r 1, nd tht ν ϵ,r increses from r 1 ln r to r s ϵ increses from to 1. Write i.o. to denote tht n event occurs infinitely often, i.e., for infinitely mny vlues of t, nd write w.p.1 for with probbility 1. Theorem 3.1. Suppose tht the btch sizes {B t } t 1 re i.i.d with common men b n(1 e λ ), finite vrince, nd upper support rtio r. Then, for ny p = e λ < 1, (i) for ll m, we hve Pr[C t = m i.o.] = 1; (ii) E[C t ] = n + p t (C n) for t > ; 111

4 Smple Size T-TBS λ =.5 φ = btch # () Growing Btch Size Smple Size T-TBS 2 λ = btch # (b) Stble Btch Size (Det.) Smple Size T-TBS 2 λ = btch # (c) Stble Btch Size (.) Figure 1: Trgeted TBS: Smple Size Behvior, λ = decy rte nd ϕ = btch size multiplier. (iii) lim t (1/t) t i= C i = n w.p.1; (iv) if C = n nd r <, then () Pr[C t (1 + ϵ)n] e nν ϵ,r + ( 1 + O(nϵp t ) ) nd (b) Pr[C t (1 ϵ)n] e nν ϵ,r (1 + O ( n(1 ϵ)p t )) for () ϵ, t > nd (b) ϵ (, 1) nd t ln ϵ/lnp. In Appendix C of [16], we ctully prove stronger version of the theorem in which the ssumption in (iv) tht r < is dropped. Thus, from (ii), lim t E[C t ] = n so tht the expected smple size converges to the trget size n s t becomes lrge; indeed, if C = n then the expected smple size equls n for ll t >. By (iii), n even stronger property holds in tht, w.p.1, the verge smple size verged over the first t btch-rrivl times converges to n s t becomes lrge. For typicl btch-size distributions, the ssertions in (iv) imply tht, t ny given time t, the probbility tht the smple size devites from n by more thn 1ϵ% decreses exponentilly with n nd in the cse of positive devition s in (iv)() super-exponentilly in ϵ. However, the ssertion in (i) implies tht ny smple size m, no mtter how lrge, will be exceeded infinitely often w.p.1; indeed, it follows from the proof tht the men times between successive exceednces re not only finite, but re uniformly bounded over time. In summry, the smple size is generlly stble nd close to n on verge, but is subject to infrequent, but unboundedly lrge spikes in the smple size, so tht smple-size control is incomplete. Indeed, when btch sizes fluctute in non-predicble wy, s often hppens in prctice, T-TBS cn brek down; see Figure 1, in which we plot smple sizes for T-TBS nd, for comprison,. The problem is tht the vlue of the men btch size b must be specified in dvnce, so tht the lgorithm cnnot hndle dynmic chnges in b without losing control of either the decy rte or the smple size. In Figure 1(), for exmple, the (deterministic) btch size is initilly fixed nd the lgorithm is tuned to trget smple size of 1, with decy rte of λ =.5. At t = 2, the btch size strts to increse (with B t+1 = ϕb t where ϕ = 1.2), leding to n overflowing smple, wheres mintins constnt smple size. Even in stble btch-size regime with constnt btch sizes (or, more generlly, smll vritions in btch size), cn mintin constnt smple size wheres the smple size under T-TBS fluctutes in ccordnce with Theorem 3.1; see Figure 1(b) for the cse of constnt btch size B t 1 with λ =.1. Lrge vritions in the btch size led to lrge fluctutions in the smple size for T-TBS; in this cse the smple size for is bounded bove by design, but lrge drops in the btch size cn cuse drops in the smple size for both lgorithms; see Figure 1(c) for the cse of λ =.1 nd i.i.d. uniformly distributed btch sizes on [, 2] so tht E[B t ] 1. Similrly, s shown in Figure 1(d), systemticlly decresing btch sizes will cuse the Smple Size T-TBS λ =.1 φ = btch # (d) Decying Btch Size smple size to shrink for both T-TBS nd. Here, λ =.1 nd, s with Figure 1(), the btch size is initilly fixed nd then strts to chnge t time t = 2, with ϕ =.8 in this cse. This experiment nd others, not reported here, with vrying vlues of λ nd ϕ indicte tht is more robust to smple underflows thn T-TBS. Overll, however, T-TBS is of interest becuse, when the men btch size is known nd constnt over time, nd when some smple overflows re tolerble, T-TBS is simple to implement nd prllelize, nd is very fst (see Section 6). For exmple, if the dt comes from periodic polling of set of robust sensors, the dt rrivl rte will be known priori nd will be reltively constnt, except for the occsionl sensor filure, nd hence T-TBS might be pproprite. On the other hnd, if dt is coming from, e.g., socil network, then btch sizes my be hrd to predict. Remrk 1. When q = 1, Theorem 3.1 provides description of smple-size behvior for B-TBS. Under the conditions of the theorem, the expected smple size converges to n = b/(1 e λ ), which illustrtes tht the smple size nd decy rte cnnot be controlled independently. The ctul smple size fluctutes round this vlue, with lrge devitions bove or below being exponentilly or super-exponentilly rre. Thus Theorem 3.1 both complements nd refines the nlysis in [27]. 4 RESERVOIR-BASED TBS Trgeted time-bised smpling (T-TBS) controls the decy rte but only prtilly controls the smple size, wheres btched reservoir smpling (B-RS) bounds the smple size but does not llow time bising. Our new reservoir-bsed time-bised smpling lgorithm () combines the best fetures of both, controlling the decy rte while ensuring tht the smple never overflows nd hs optiml smple size nd stbility properties. Importntly, unlike T-TBS, the lgorithm cn hndle ny sequence of btch sizes. 4.1 The Algorithm To mintin bounded smple, combines the use of reservoir with the notion of item weights. In, the weight of n item initilly equls 1 but then decys t rte λ, i.e., the weight of n item i B t t time t t is w t (i) = e λ(t t). All items rriving t the sme time hve the sme weight, so tht the totl weight of ll items seen up through time t is W t = tj=1 B j e λ(t j), where, s before, B j = B j is the size of the jth btch. genertes sequence of ltent frctionl smples {L t } t such tht (i) the size of ech L t equls the smple weight C t, defined s C t = min(n,w t ), nd (ii) L t contins C t full items nd t most one prtil item. For exmple, ltent smple of size C t = 3.6 contins three full items tht belong to the ctul smple S t with probbility 1 nd one prtil item tht 112

5 Algorithm 2: Reservoir-bsed TBS () 1 λ: decy fctor ( ) 2 n: mximum smple size 3 Initilize: A A ; W C A ; π // A n 4 for t 1, 2,... do 5 if W < n then //hs been unsturted 6 W e λ W //decy current items 7 if W > then 8 (A, π, C) Dsmple ( (A, π, C), W ) 9 A A B t //ccept ll items in B t 1 W W + B t //updte totl weight 11 if W > n then //smple is now sturted //djust for overshoot 12 (A, π, C) Dsmple ( (A, π, W ), n ) 13 else //hs been sturted 14 W e λ W + B t //new totl weight 15 if W n then //still sturted 16 m StochRound( B t n/w ) //replce m A-items with m B t -items 17 A A \ Smple(A, m) Smple(B t, m) 18 else //now unsturted //djust for undershoot 19 (A, π, C) Dsmple ( (A, π, n), W B t ) 2 A A B t //ll btch items re full 21 S getsmple(a, π, C) 22 output S b c d prtil item b c b c d Figure 2: Ltent smple L t (smple weight C t = 3.6) nd possible relized smples. belongs to S t with probbility.6. Thus S t is obtined by including ech full item nd then including the prtil item ccording to its ssocited probbility, so tht C t represents the expected size of S t. E.g., in our exmple, the smple S t will contin either three or four items with respective probbilities.4 nd.6, so tht the expected smple size is 3.6; see Figure 2. Note tht if C t = k for some k {, 1,..., n}, then with probbility 1 the smple contins precisely k items, nd C t is the ctul size of S t, rther thn just the expected size. Since ech C t by definition never exceeds n, no smple S t ever contins more thn n items. More precisely, given set U of items, ltent smple of U with smple weight C is triple L = (A, π,c), where A U is set of C full items nd π U is (possibly empty) set contining t most one prtil item. At ech time t, we rndomly generte S t from L t = (A t, π t,c t ) by smpling such tht { A t π with probbility frc(c t ); S t = (2) A t with probbility 1 frc(c t ), where frc(x) = x x. Tht is, ech full item is included with probbility 1 nd the prtil item is included with probbility frc(c t ). Thus E[ S t ] = C t frc(c t ) + C t ( 1 frc(c t ) ) = ( C t C t ) frc(c t ) + C t = frc(c t ) + C t = C t (3) s previously sserted. By llowing t most one prtil item, we minimize the ltent smple s footprint: A t π t C t + 1. The key gol of is to mintin the invrint Pr[i S t ] = ( C t /W t ) wt (i) (4) for ech t nd ech item i U t, where, s before, U t denotes the set of ll items tht rrive up through time t, so tht the ppernce probbility for n item i t time t is proportionl to its weight w t (i). This immeditely implies the desired reltiveinclusion property (1). Since w t (i) = 1 for n rriving item i B t, the equlity in (4) implies tht the initil cceptnce probbility for this item is Pr[i S t ] = C t /W t. (5) The pseudocode for is given s Algorithm 2. Suppose the smple is unsturted t time t 1 in tht W t 1 < n nd hence C t 1 = W t 1 (line 5). The decy process first reduces the totl weight (nd hence the smple weight) to W t 1 = C t 1 = e λ W t 1 (line 6). then downsmples L t 1 (line 8) to reflect this decy nd mintin miniml smple footprint; the downsmpling method, described in Section 4.2, is designed to mintin the invrint in (4). If the weight of the rriving btch does not cuse the smple to overflow, i.e., C t 1 + B t < n, then C t = C t 1 + B t = W t 1 + B t = W t. The reltion in (5) then implies tht ll newly rrived items re ccepted into the smple with probbility 1 (line 9); see Figure 3() for n exmple of this scenrio. The sitution is more complicted if the weight of the rriving btch would cuse the smple to overflow. It turns out tht the simplest wy to del with this scenrio is to initilly ccept ll incoming items s in line 9, nd then run n dditionl round of downsmpling to reduce the smple weight to n (line 12), so tht the smple is now sturted; see Figure 3(b). Note tht these two steps cn be executed without ever cusing the smple footprint to exceed n. Now suppose tht the smple is sturted t time t 1, so tht W t 1 n nd hence C t 1 = S t 1 = n. The new totl weight is W t = W t 1 + B t s before (line 14). If W t n, then the weight of the rriving btch exceeds the weight loss due to decy, nd the smple remins sturted. Then (5) implies tht ech item in B t is ccepted into the smple with probbility p = n/w t. Letting I j = 1 if item j B is ccepted nd I j = otherwise, we see tht the expected number of ccepted items is [ ] m = E I j = E[I j ] = Pr[I j = 1] = B t n/w t. j B t j B t j B t There re number of possible wys to crry out this cceptnce opertion, e.g., vi independent coin flips. To minimize the vribility of the smple size (nd hence the likelihood of severely smll smples), uses stochstic rounding in line 16 nd ccepts rndom number of items M such tht M = m with probbility m m nd M = m with probbility m m, so tht E[M] = m by n rgument essentilly the sme s in (3). To mintin the bound on the smple size, the M ccepted items replce M rndomly selected victims in the current smple (line 17). If W t < n, then the smple weight decys to W t 1 nd the weight of the rriving btch is not enough to fill the smple bck up. Moreover, (5) implies tht ll rriving items re ccepted with probbility 1. Thus we downsmple to the decyed weight of W t 1 = W t B t in line 19 nd then insert the rriving items in line Downsmpling Before describing Algorithm 3, the downsmpling lgorithm, we intuitively motivte key property tht ny such procedure must hve. For ny item i L, the reltion in (4) implies tht we must hve Pr[i S] = (C/W )w i nd Pr[i S ] = (C /W )w i, where W nd w i represent the totl nd item weight before decy nd 113

6 c d e f g b c d e c d e b c d e f g c e g b c d b c d e b d b d e b c d b c d e f g f b c g () Unst. Unst. (b) Unst. St. (c) St. Unst. (d) St. St. Figure 3: scenrios for n = 4 nd e λ =.5. For simplicity, we tke W t 1 = C t 1. DS denotes downsmpling. Algorithm 3: Downsmpling 1 L = (A, π, C): input ltent smple 2 C : input trget weight with < C < C 3 L = (A, π, C ): output ltent smple 4 U orm() 5 if C = then //no full items retined 6 if U > frc(c)/c then 7 (A, π ) Swp1(A, π ) 8 A 9 else if < C = C then //no items deleted 1 if U > ( 1 (C /C) frc(c) ) / ( 1 frc(c ) ) then 11 (A, π ) Swp1(A, π ) 12 else //items deleted: < C < C 13 if U (C /C) frc(c) then 14 A Smple(A, C ) 15 (A, π ) Swp1(A, π ) 16 else 17 A Smple(A, C + 1) 18 (A, π ) Move1(A, π ) 19 if C = C then //no frctionl item 2 π downsmpling, nd W nd w i represent the weights fterwrds. Since decy ffects ll items eqully, we hve w/w = w /W, nd it follows tht Pr[i S ] = (C /C) Pr[i S]. (6) Tht is, the inclusion probbilities for ll items must be scled down by the sme frction, nmely C /C. Theorem 4.1 (lter in this section) sserts tht Algorithm 3 stisfies this property. In the pseudocode for Algorithm 3, the function orm() genertes rndom number uniformly distributed on [, 1]. The subroutine Swp1(A, π) moves rndomly selected item from A to π nd moves the current item in π (if ny) to A. Similrly, Move1(A, π) moves rndomly selected item from A to π, replcing the current item in π (if ny). More precisely, Swp1(A, π) executes the opertions I Smple(A, 1), A (A\I) π, nd π I, nd Move1(A, π) executes the opertions I Smple(A, 1), A A \ I, nd π I. To gin some intuition for why the lgorithm works, consider simple specil cse, where the gol is to form frctionl smple L = (A, π,c ) from frctionl smple L = (A, π,c) of integrl size C > C ; tht is, L comprises exctly C full items. Assume tht C is non-integrl, so tht L contins prtil item. In this cse, we simply select n item t rndom (from A) to be the prtil item in L nd then select C of the remining C 1 items t rndom to be the full items in L ; see Figure 4(). By symmetry, ech item i L is eqully likely to be included in S, so tht the inclusion probbilities for the items in L re ll scled down by the sme frction, s required for (6). For exmple, tking t = in Figure 4(), item ppers in S t with probbility 1 since it is full item. In S t, where the weights hve been reduced by 5%, item (either s full or prtil item, depending on the rndom outcome) ppers with probbility 2 (1/6) + 2 (1/6).5 =.5, s expected. This scenrio corresponds to lines 17 nd 18 in the lgorithm, where we crry out the bove selections by rndomly smpling C + 1 items from A to form A nd then choosing rndom item in A s the prtil item by moving it to π. In the cse where L contins prtil item i tht ppers in S with probbility frc(c), it follows from (6) tht i should pper in S with probbility p = (C /C)P[i S] = (C /C) frc(c). Thus, with probbility p, lines retin i nd convert it to full item so tht it ppers in S. Otherwise, in lines 17 nd 18, i is removed from the smple when it is overwritten by rndom item from A ; see Figure 4(b). Agin, new prtil item is chosen from A in rndom mnner to uniformly scle down the inclusion probbilities. For instnce, in Figure 4(b), item d ppers in S t with probbility.2 (becuse it is prtil item) nd in S t, ppers with probbility 3 (.1/3) =.1. Similrly, item ppers in S t with probbility 1 nd in S t with probbility (1.8)/6 +.6 (1.8/6) +.6 (.1/3) =.5. The if-sttement in line 5 corresponds to the corner cse in which L does not contin full item. The prtil item i L either becomes full or is swpped into A nd then immeditely ejected; see Figure 4(c). The if-sttement in line 9 corresponds to the cse in which no items re deleted from the ltent smple, e.g., when C = 4.7 nd C = 4.2. In this cse, i either becomes full by being swpped into A or remins s the prtil item for L. Denoting by ρ the probbility of not swpping, we hve P[i S ] = ρ frc(c ) + (1 ρ) 1. On the other hnd, (6) implies tht P[i S ] = (C /C) frc(c). Equting these expression shows tht ρ must equl the expression on the right side of the inequlity on line 1; see Figure 4(d). Formlly, we hve the following result. Theorem 4.1. For < C < C, let L = (A, π,c ) be the ltent smple produced from ltent smple L = (A, π,c) vi Algorithm 3, nd let S nd S be smples produced from L nd L vi (2). Then Pr[i S ] = (C /C) Pr[i S] for ll i L. 4.3 Properties of Theorem 4.2 below sserts tht stisfies (4) nd hence (1), thereby mintining the correct inclusion probbilities; see Appendix C of [16] for the proof. Theorems 4.3 nd 4.4 ssert tht, mong ll smpling lgorithms with exponentil time bising, both mximizes the expected smple size in unsturted scenrios nd minimizes smple-size vribility. Thus tends to yield more ccurte results (from more trining dt) nd greter stbility in both result qulity nd retrining costs. Theorem 4.2. The reltion Pr[i S t ] = (C t /W t )w t (i) holds for ll t 1 nd i U t. Theorem 4.3. Let H be ny smpling lgorithm tht stisfies (1) nd denote by S t nd St H the smples produced t time t by nd H. If the totl weight t some time t 1 stisfies W t < n, then E[ St H ] E[ S t ]. Proof. Since H stisfies (1), it follows tht, for ech time j t nd i B j, the inclusion probbility Pr[i St H ] must be of the 114

7 b c d b c b c b c b c d d b d c c c b c b c b b c c b b c c b c b c c b b c () From C t = 3 to C t = 1.5. form r t e λ(t j) for some function r t independent of j. Tking j = t, we see tht r t 1. For in n unsturted stte, (4) implies tht r t = C t /W t = 1, so tht Pr[i St H ] Pr[i S t ], nd the desired result follows directly. Theorem 4.4. Let H be ny smpling lgorithm tht stisfies (1) nd hs mximl expected smple size C t nd denote by S t nd S H t the smples produced t time t by nd H. Then Vr[ S H t ] Vr[ S t ] for ny time t 1. Proof. Considering ll possible distributions over the smple size hving men vlue equl toc t, it is strightforwrd to show tht vrince is minimized by concentrting ll of the probbility mss onto C t nd C t. There is precisely one such distribution, nmely the stochstic-rounding distribution, nd this is precisely the smple-size distribution ttined by. 5 DISTRIBUTED TBS ALGORITHMS In this section, we describe how to implement distributed versions of T-TBS nd to hndle lrge volumes of dt. 5.1 Overview of Distributed Algorithms The distributed T-TBS nd lgorithms, denoted s D-T- TBS nd D- respectively, need to distribute lrge dt sets cross the cluster nd prllelize the computtion on them. Overview of D-T-TBS: The implementtion of the D-T-TBS lgorithm is very similr to the simple distributed Bernoulli timebised smpling lgorithm in [27]. It is embrrssingly prllel, requiring no coordintion. At ech time point t, ech worker in the cluster subsmples its prtition of the smple with probbility p, subsmples its prtition of B t with probbility q, nd then tkes union of the resulting dt sets. Overview of D-: This lgorithm, unlike D-T-TBS, mintins bounded smple, nd hence cnnot be embrrssingly prllel. D- first needs to ggregte locl btch sizes to compute the incoming btch size B t to mintin the totl weight W. Then, bsed on B t nd the previous totl weight W, D-R- TBS determines whether the reservoir ws previously sturted nd whether it will be sturted fter processing B t. For ech possible sitution, D- chooses the items in the reservoir to delete through downsmpling nd the items in B t to insert into the reservoir. This process requires the mster to coordinte mong the workers. In Section 5.3, we introduce two lterntive pproches to determine the deleted nd inserted items. Finlly, the lgorithm pplies the deletes nd inserts to form the new reservoir, nd computes the new totl weight W. Both D-T-TBS nd D- periodiclly checkpoint the smple s well s other system stte vribles to ensure fult tolernce. The implementtion detils for D-T-TBS re mostly subsumed by those for D-, so we focus on the ltter. (b) From C t = 3.2 to C t = 1.6. (c) From C t = 2.4 to C t =.4. Figure 4: Downsmpling exmples (t = ). Reservoir Incoming Btch Key-Vlue Store () inserts (d) From C t = 2.4 to C t = 2.1. Co-Prtitioned Reservoir prtition prtition 1 prtition k prtition prtition 1 prtition k (b) inserts prtition prtition 1 prtition k prtition prtition 1 prtition k Figure 5: Design choices for implementing the reservoir prtition (, 1) prtition A B C slot (prtition, position) 1 (, 1) 5 (1, 2) 7 (1, 4) 1 (2, 2) prtition 1 (1, 2) (1, 4) Item Loctions prtition 1 D E F G H Incoming Items () prtition 2 (2, 2) prtition 2 I J K L prtition A B C Distributed Decisions 1 slot 2 slots 1 slot prtition 1 D E F G H Incoming Items (b) Figure 6: Retrieving insert items 5.2 Distributed Dt Structures prtition 2 There re two importnt dt structures in the D- lgorithm: the incoming btch nd the reservoir. Conceptully, we view n incoming btch B t s n rry of slots numbered from 1 through B t, nd the reservoir s n rry of slots numbered from 1 through C contining full items plus specil slot for the prtil item. For both dt structures, dt items need to be distributed into prtitions due to the lrge dt volumes. Therefore, the slot number of n item mps to specific prtition ID nd position inside the prtition. The incoming btch usully comes from distributed streming system, such s Sprk Streming; the ctul dt structure is specific to the streming system (e.g. n incoming btch is stored s n RDD in Sprk Streming). As result, the prtitioning strtegy of the incoming btch is opque to the D- lgorithm. Unlike the incoming btch, which is red-only nd discrded t the end of ech time period, the reservoir dt structure must be continully updted. An effective strtegy for storing nd operting on the reservoir is thus crucil for good performnce. We now explore lterntive pproches to implementing the reservoir. Distributed in-memory key-vlue store: One quite nturl pproch implements the reservoir using n off-the-shelf distributed in-memory key-vlue store, such s Redis [25] or Memcched [23]. In this scheme, ech item in the reservoir is stored s key-vlue pir, with the slot number s the key nd the item s the vlue. Inserts nd deletes to the reservoir nturlly trnslte into put nd delete opertions to the key-vlue store. I J K L 115

8 There re two mjor limittions to this pproch. Firstly, the hsh-bsed or rnge-bsed dt-prtitioning scheme used by distributed key-vlue store yields reservoir prtitions tht do not correlte with the prtitions of incoming btch. As illustrted in Figure 5(), when items from given prtition of n incoming btch re inserted into the reservoir, the inserts touch mny (if not ll) prtitions of the reservoir, incurring hevy network I/O. Secondly, key-vlue stores incur needless concurrency-control overhed. For ech btch, D- lredy crefully coordintes the deletes nd inserts so tht no two delete or insert opertions ccess the sme slots in the reservoir nd there is no dnger of write-write or red-write conflicts. Co-prtitioned reservoir: In the lterntive pproch, we implement distributed in-memory dt structure for the reservoir so s to ensure tht the reservoir prtitions coincide with the prtitions from incoming btches, s shown in Figure 5(b). This cn be chieved in spite of the unknown prtitioning scheme of the streming system. Specificlly, the reservoir is initilly empty, nd ll items in the reservoir re from the incoming btches. Therefore, if n item from given prtition of n incoming btch is lwys inserted into the corresponding locl reservoir prtition nd deletes re lso hndled loclly, then the co-prtitioning nd co-loction of the reservoir nd incoming btch prtitions is utomtic. For our experiments, we implemented the co-prtitioned reservoir in Sprk using the in-plce updting technique for RDDs in [27]; see Appendix E of [16]. Note tht, t ny point in time, given slot number in the reservoir mps to specific prtition ID nd position inside the prtition. Thus the slot number for given full item my chnge over time due to reservoir insertions nd deletions. This does not cuse ny sttisticl issues, becuse the functioning of the set-bsed lgorithm is oblivious to specific slot numbers. 5.3 Choosing Items to Delete nd Insert In order to bound the reservoir size, D- requires creful coordintion when choosing the set of items to delete from, nd insert into, the reservoir. At the sme time, D- must ensure the sttisticl correctness of rndom number genertion nd rndom permuttion opertions in the distributed environment. We consider two possible pproches. Centrlized decisions: In the most strightforwrd pproch, the mster mkes centrlized decisions bout which items to delete nd insert. For deletes, the driver genertes slot numbers of the items in the reservoir to be deleted, which re then mpped to the ctul dt loctions in mnner tht depends on the representtion of the reservoir (key-vlue store or co-prtitioned reservoir). For inserts, the driver genertes the slot numbers of the incoming items B t t time t tht need to be inserted into the reservoir. Suppose tht B t comprises k 1 prtitions. Ech generted slot number i {1, 2,..., B t } is mpped to prtition p i of the B t (where p i k 1) nd position r i inside prtition p i. Denote by Q the set of item loctions, i.e., the set of (p i, r i ) pirs. In order to perform the inserts, we need to first retrieve the ctul items bsed on the item loctions. This cn be chieved with join-like opertion between Q nd B t, with the (p i, r i ) pir mtching the ctul loction of n item inside B t. To optimize this opertion, we mke Q distributed dt structure nd use customized prtitioner to ensure tht ll pirs (p i, r i ) with p i = j re co-locted with prtition j of B t for j =, 1,..., k 1. Then co-prtitioned nd co-locted join cn be crried out between Q nd B t, s illustrted in Figure 6() for k = 3. The resulting set of retrieved insert items, denoted s S, is lso co-prtitioned with B t s by-product. After tht, the ctul deletes nd inserts re then crried out depending on how reservoir is stored, s discussed below. When the reservoir is implemented s key-vlue store, the deletes cn be directly pplied bsed on the slot numbers. For inserts, the mster tkes ech generted slot number of n item in B t nd chooses compnion destintion slot number in the reservoir into which the B t item will be inserted. This destintion reservoir slot might currently be empty due to n erlier deletion, or might contin n item tht will now be replced by the newly inserted btch item. After the ctul items to insert re retrieved s described previously, the destintion slot numbers re used to put the items into the right loctions in the key-vlue store. When the co-prtitioned reservoir is used, the delete slot numbers in the reservoir re mpped to (p i, r i ) pirs of prtitions of the reservoir nd positions inside the prtitions. As with inserts, we gin use customized prtitioner for the set of pirs R such tht deletes re co-locted with the corresponding reservoir prtitions. Then join-like opertion on R nd the reservoir performs the ctul delete opertions on the reservoir. For inserts, we simply use nother join-like opertion on the set of retrieved insert items S nd the reservoir to dd the corresponding insert items to the co-locted prtition of the reservoir. In this pproch, we don t need the mster to generte destintion reservoir slot numbers for these insert items, becuse we view the reservoir s set when using co-prtitioned reservoir dt structure. Distributed decisions: The bove pproch requires generting lrge number of slot numbers inside the mster, so we now explore n lterntive pproch tht offlods the slot number genertion to the workers while still ensuring the sttisticl correctness of the computtion. This pproch hs the mster choose only the number of deletes nd inserts per worker ccording to pproprite multivrite hypergeometric distributions. For deletes, ech worker chooses rndom victims from its locl prtition of the reservoir bsed on the number of deletes given by the mster. For inserts, the worker rndomly nd uniformly selects items from its locl prtition of the incoming btch B t given the number of inserts. Figure 6(b) depicts how the insert items re retrieved under this decentrlized pproch. We use the technique in [15] for prllel pseudo-rndom number genertion. Note tht this distributed decision mking pproch works only when the co-prtitioned reservoir dt structure is used. This is becuse the key-vlue store representtion of the reservoir requires trget reservoir slot number for ech insert item from the incoming btch, nd the trget slot numbers hve to be generted in such wy s to ensure tht, fter the deletes nd inserts, ll of the slot numbers re still unique nd contiguous in the new reservoir. This requires lot of coordintion mong the workers, which inhibits truly distributed decision mking. 6 EXPERIMENTS In this section, we study the empiricl performnce of D- nd D-T-TBS, nd demonstrte the potentil benefit of using them for model retrining in online model mngement. We implemented D- nd D-T-TBS on Sprk (refer to Appendix E of [16] for implementtion detils). Experimentl Setup: All performnce experiments were conducted on cluster of 13 IBM System x idtplex dx34 servers. Ech hs two qud-core Intel Xeon E GHz processors nd 32GB of RAM. Servers re interconnected using 1Gbit Ethernet nd ech server runs Ubuntu Linux, Jv 1.7 nd Sprk

9 Execution Time (sec) One server is dedicted to run the Sprk coordintor nd, ech of the remining 12 servers runs Sprk workers. There is one worker per processor on ech mchine, nd ech worker is given ll 4 cores to use, long with 8 GB of dedicted memory. All other Sprk prmeters re set to their defult vlues. We used Memcched s the key-vlue store in our experiments. For ll experiments, dt ws stremed in from HDFS using Sprk Streming s microbtches. We report run time per round s the verge over 1 rounds, discrding the first round from this verge becuse of Sprk strtup costs. Unless otherwise stted, ech btch contins 1 million items, the trget reservoir size is 2 million elements, nd the decy prmeter is λ = Execution Time (sec) D- D- (Cent,KV,RJ) (Cent,KV,CJ) D- (Cent,CP) D- (Dist,CP) D-T-TBS (Dist,CP) Figure 7: Per-btch distributed runtime comprison Number of Workers Figure 8: Scle out of D Runtime Performnce Execution Time (sec) x1 6 1x1 7 1x1 8 1x1 9 1x1 1 Btch Size (log scle) Figure 9: Scle up of D- Comprison of TBS Implementtions: Figure 7 shows the verge runtime per btch for five different implementtions of distributed TBS lgorithms. The first four (colored blck) re D- implementtions with different design choices: whether to use centrlized or distributed decisions (bbrevited s "Cent" nd "Dist", respectively) for choosing items to delete nd insert, nd whether to use key-vlue store for storing reservoir or coprtitioned reservoir (bbrevited s "KV" nd "CP", respectively). The first two implementtions both use the key-vlue store representtion for reservoir together with the centrlized decision strtegy for determining inserts nd deletes. They only differ in how the insert items re ctully retrieved when subsmpling the incoming btch. The first uses the stndrd reprtition join (bbrevited s "RJ"), wheres the second uses the customized prtitioner nd co-locted join (bbrevited s "CJ") s described in Section 5.3 nd depicted in Figure 6(). This optimiztion effectively cuts the network cost in hlf, but the KV representtion of reservoir still requires the insert items to be written cross the network to their corresponding reservoir loction. The third implementtion employs the co-prtitioned reservoir insted, resulting in n significnt speedup of over 2.6x. The fourth implementtion further employs the distributed decision for choosing items to delete nd insert. This yields further 1.6x speedup. We use this D- implementtion in the remining experiments. The fifth implementtion (colored grey) in Figure 7 is D-T- TBS using co-prtitioned reservoir nd the distributed strtegy for choosing delete nd insert items. Since, D-T-TBS is embrrssingly prllelizble, it s much fster thn the best D- implementtion. But, s we discussed in Section 3, T-TBS only works under very strong restriction on the dt rrivl rte, nd cn suffer from occsionl memory overflows; see Figure 1. In contrst, D- is much more robust nd works in relistic scenrios where it is hrd to predict the dt rrivl rte. Sclbility of D-: Figure 8 shows how D- scles with the number of workers. We incresed the btch size to 1 million items for this experiment. Initilly, D- scles out very nicely with the incresing number of workers. However, beyond 1 workers, the mrginl benefit from dditionl workers is smll, becuse the coordintion nd communiction overheds, s well s the inherent Sprk overhed, become prominent. For the sme resons, in the scle-up experiment in Figure 9, the runtime stys roughly constnt until the btch size reches 1 million items nd increses shrply t 1 million items. This is becuse processing the streming input nd mintining the smple strt to dominte the coordintion nd communiction overhed. With 1 workers, cn hndle dt flow comprising 1 million items rriving pproximtely every 14 seconds. 6.2 Appliction: Clssifiction using knn We now demonstrte the potentil benefits of the smpling scheme for periodiclly retrining representtive ML models in the presence of evolving dt. For ech model nd dt set, we compre the qulity of models retrined on the smples generted by, simple sliding window (), nd uniform reservoir smpling (). Due to limited spce, we do not give qulity results for T-TBS; we found tht whenever it pplies i.e. when the men btch size is known nd constnt the qulity is very similr to, since they both use time-bised smpling. Our first model is knn clssifier, where clss is predicted for ech item in n incoming btch by tking mjority vote of the clsses of the k nerest neighbors in the current smple, bsed on Eucliden distnce; the smple is then updted using the btch. To generte trining dt, we first generte 1 clss centroids uniformly in [, 8] [, 8] rectngle. Ech dt item is then generted from Gussin mixture model nd flls into one of the 1 clsses. Over time, the dt genertion process opertes in one of two modes". In the norml" mode, the frequency of items from ny of the first 5 clsses is five times higher thn tht of items in ny of the second 5 clsses. In the bnorml" mode, the frequencies re five times lower. Thus the frequent nd infrequent clsses switch roles t mode chnge. We generte ech dt point by rndomly choosing ground-truth clss c i with centroid (x i,y i ) ccording to reltive frequencies tht depend upon the current mode, nd then generting the dt point s (x,y) coordintes independently s smples from N (x i, 1) nd N (y i, 1). Here N (µ, σ) denotes the norml distribution with men µ nd stndrd devition σ. In this experiment, the btch sizes re deterministic with b = 1 items, nd k = 7 neighbors for the knn clssifier. The reservoir size for both nd is 1, nd contins the lst 1 items; thus ll methods use the sme mount of dt for retrining. (We choose this vlue becuse it chieves ner mximl clssifiction ccurcies for ll techniques. In generl, we choose smpling nd ML prmeters to chieve good lerning performnce while ensuring fir comprisons.) In ech run, the smple is wrmed up by processing 1 norml-mode btches before the clssifiction tsk begins. Our experiments focus on two types of temporl ptterns in the dt, s described below. Single chnge: Here we model the occurrence of singulr event. The dt is generted in norml mode up to t = 1 (time is mesured here in units fter wrm-up), then switches to bnorml 117

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue