Optimal Workload-based Weighted Wavelet Synopses

Size: px

Start display at page:

Download "Optimal Workload-based Weighted Wavelet Synopses"

Dana Sanders
6 years ago
Views:

1 Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel Abstract In recent years wavelets were shown to be effectve data synopses. We are concerned wth the problem of fndng effcently wavelet synopses for massve data sets, n stuatons where nformaton about query workload s avalable. We present lnear tme, I/O optmal algorthms for buldng optmal workload-based wavelet synopses for pont queres. The synopses are based on a novel constructon of weghted nner-products and use weghted wavelets that are adapted to those products. The synopses are optmal n the sense that the subset of retaned coeffcents s the best possble for the bases n use wth respect to ether the mean-squared absolute or relatve errors. For the latter, ths s the frst optmal wavelet synopss even for the regular, non-workload-based case. Expermental results demonstrate the advantage obtaned by the new optmal wavelet synopses, as well as the robustness of the synopses to devatons n the actual query workload. 1 Introducton In recent years there has been ncreasng attenton to the development and study of data synopses, as effectve means for addressng performance ssues n massve data sets. Data synopses are concse representatons of data sets, that are meant to effectvely support approxmate queres to the represented data sets [10]. A prmary constrant of a data synopss s ts sze. The effectveness of a data synopss s measured by the accuracy of the answers t provdes, as well as by ts response tme and ts constructon tme. Several dfferent synopses were ntroduced and studed, ncludng random samples, sketches, and dfferent types of hstograms. Recently, wavelet-based synopses were ntroduced and shown to be a powerful tool for buldng effectve data synopses for varous applcatons, ncludng selectvty estmaton for query optmzaton n DBMS, approxmate query processng n OLAP applcatons and more (see [16, 20, 21, 2, 6, 9, 8], and references theren). The general dea of wavelet-based approxmatons s to transform a gven data vector of sze nto a representaton wth respect to a wavelet bass (ths s called a wavelet transform), and approxmate t usng only M wavelet bass vectors, by retanng only M coeffcents from the lnear combnaton that spans the data vector (coeffcents thresholdng). The lnear combnaton Research partly supported by a grant from the Israel Scence Foundaton. Contact author 1

2 that uses only M coeffcents (and assumes that all other coeffcents are zero) defnes a new vector that approxmates the orgnal vector, usng less space. Ths s called M-term approxmaton, whch defnes a wavelet synopss of sze M. Wavelet synopses. Wavelets were tradtonally used to compress some data set where the purpose s to reconstruct, n a later tme, an approxmaton of the whole data usng the set of retaned coeffcents. The stuaton s a lttle dfferent when usng wavelets for buldng synopses n database systems [16]: n ths case only portons of the data are reconstructed each tme, n response to user queres, rather than the whole data at once. As a result, portons of the data that are used for answerng frequent queres are reconstructed more frequently than portons of the data that correspond to rare queres. Therefore, the approxmaton error s measured over the mult-set of actual queres, rather than over the data tself. Another aspect of the use of wavelets n database systems s that due to the large data-szes n databases (gga-, tera- and peta-bytes), the effcency of buldng wavelet synopses s of prmary mportance. Dsk I/Os should be mnmzed as much as possble, and non-lnear-tme algorthms may be unacceptable. Optmal wavelet synopses. The man advantage of transformng the data nto a representaton wth respect to a wavelet bass s that for data vectors contanng smlar values, many wavelet coeffcents tend to have very small values. Thus, elmnatng such small coeffcents ntroduces only small errors when reconstructng the orgnal data, resultng n a very effectve form of lossy data compresson. Generally speakng, we can characterze a wavelet approxmaton by three attrbutes: how the approxmaton error s measured, what wavelet bass s used and how coeffcent thresholdng s done. Many bases were suggested and used n tradtonal wavelets lterature. Gven a bass wth respect to whch the transform s done, the selecton of coeffcents that are retaned n the wavelet synopss may have sgnfcant mpact on the approxmaton error. The goal s therefore to select a subset of M coeffcents that mnmzes some approxmaton-error measure. Ths subset s called an optmal wavelet synopss, wth respect to the chosen error measure. Whle there has been a consderable work on wavelet synopses and ther applcatons [16, 20, 21, 2, 6, 9, 8], so far there were only a few optmalty results. The frst one s a lnear-tme Parsevalbased algorthm, whch was used n tradtonal wavelets lterature (e.g [12]), where the error was measured over the data. Ths algorthm mnmzes the L 2 norm of the error vector, and equvalently t mnmzes the mean-squared-absolute error over all possble pont queres. o algorthm that mnmzes the mean-squared-relatve error over all possble pont queres was known. The second one, ntroduced recently [9], s a polynomal-tme (O( 2 M log M)) algorthm that mnmzes the max relatve or absolute error over all possble pont queres. Another optmalty result s a polynomal tme dynamc-programmng algorthm that obtans an optmal wavelet synopss over multple measures [6]. The synopss s optmal w.r.t. an error metrc defned as weghted combnaton of L 2 norms over the multple measures (ths weghted combnaton has no relaton wth the noton of weghted wavelets of ths paper). Workload-based wavelet synopses. In recent years there s ncreased nterest n workloadbased synopses synopses that are adapted to a gven query workload, wth the assumpton that the workload represents (approxmately) a probablty dstrbuton from whch future queres wll be taken. Chaudhur et al [4] argue that dentfyng an approprate precomputed sample that avods large errors on an arbtrary query s vrtually mpossble. To mnmze the effects of ths problem, 2

3 prevous studes have proposed usng the workload to gude the process of selectng samples [1, 3, 7]. By pckng a sample that s tuned to the gven workload, we can reduce the error over frequent (or otherwse mportant ) queres n the workload. In [4], the authors formulate the problem of pre-computng a sample as an optmzaton problem, whose goal s to pck a sample that mnmzes the error for the gven workload. Recently, workload-based wavelet synopses were proposed [14, 18]. Usng an adaptve-greedy algorthm, the query-workload nformaton was used durng the thresholdng process n order to buld a wavelet synopss that decreases the error w.r.t. to the query workload. Whle these workloadbased wavelet synopses demonstrate sgnfcant mporvement wth respect to pror synopses, they are not optmal. In ths paper, we address the problem of fndng effcently optmal workload-based wavelet synopses. 1.1 Contrbutons We ntroduce effcent algorthms for fndng optmal workload-based wavelet synopses usng weghted Haar (WH) wavelets, for workloads of pont queres. Our man contrbutons are: Lnear-tme, I/O optmal algorthms that fnd optmal Workload-based Weghted Wavelet (WWW) synopses: 1 An optmal synopss w.r.t. workload-based mean-squared absolute-error (WB-MSE). An optmal synopss w.r.t. workload-based mean-squared relatve-error (WB-MRE). Equvalently, the algorthms mnmze the expected squared, absolute or relatve errors over a pont query taken from a gven dstrbuton. The WB-MRE algorthm, used wth unform workload, s also the frst algorthm that mnmzes the mean-squared-relatve-error over the data values, wth respect to a wavelet bass. Both WWW synopses are also optmal wth respect to enhanced wavelet synopses, whch allow changng the values of the synopses coeffcents to arbtrary values. Expermental results show the advantage of our synopses wth respect to exstng synopses. The synopses are robust to devaton from the pre-defned workload, as demonstrated by our experments. The above results were obtaned usng the followng novel technques. We defne the problem of fndng optmal workload-based wavelet synopses n terms of a weghted norm, a weghted-nner-product and a weghted-nner-product-space. Ths enables lnear tme I/O optmal algorthms for buldng optmal workload-based wavelet synopses. The approach of usng a weghted nner product can also be used to the general case n whch each data pont s gven dfferent prorty, representng ts sgnfcance (an example s shown n Sec. 6). Usng these weghts, one can fnd a weghted-wavelet bass, and an optmal weghted wavelet synopss n lnear tme, wth O(/B) I/Os. 1 o relaton whatsover to the world-wde-web. 3

4 We ntroduce the use of weghted wavelets for data synopses. Usng weghted wavelets [5, 11] enables fndng optmal workload-based wavelet synopses effcently. In contrast, t s not known how to obtan optmal workload-based wavelet synopses wth respect to the Haar bass effcently. If we gnore the effcency of fndng a synopss, the Haar bass s as good as the weghted Haar bass for approxmaton. In wavelets lterature (e.g [12]), wavelets are used to approxmate a gven sgnal, whch s treated as a vector n an nner-product space. Snce an nner-product defnes an L 2 norm, the approxmaton error s measured as the L 2 norm of the error vector, whch s the dfference between the approxmated vector and the approxmatng vector. Many wavelet bases were used for approxmaton, as dfferent bases are adequate for approxmatng dfferent collectons of data vectors. By usng an orthonormal wavelet bass, an optmal coeffcent thresholdng can be acheved n lnear tme, based on Parseval s formula. When usng non-orthogonal wavelet bass, or measurng the error usng other norms (e.g L ), t s not known whether an optmal coeffcent thresholdng can be found effcently, so usually non-optmal greedy algorthms are used n practce. A WH bass s a generalzaton of the standard Haar bass, whch s typcally used for wavelet synopses due to ts smplcty. There are several attrbutes by whch a wavelet bass s characterzed, whch affects the qualty of the approxmatons acheved usng ths bass (for full dscusson, see [12]). These attrbute are: the set of nested spaces of ncreasng resoluton whch the bass spans, the number of vanshng moments of the bass, and ts compact support (f exsts). Both Haar bass and a WH bass span the same subsets of nested spaces, have one vanshng moment, and a compact support of sze 1. Haar bass s orthonormal for unform workload of pont queres. Hence t s optmal for the M SE error measure. The WH bass s orthonormal wth respect to the weghted nner-product defned by the problem of fndng optmal workload-based wavelet synopses. As a result, an optmal workload-based synopses wth respect to WH bass s acheved effcently, based on Parseval s formula, whle for the Haar bass no effcent optmal thresholdng algorthm s known, n cases other than unform workload. 1.2 Paper outlne The rest of the paper s structured as follows. In Sec. 2 we descrbe the bascs of wavelet-based synopses. In Sec. 3 we descrbe the basc deas we rely on n our development, ncludng the workload-based error metrcs and optmal thresholdng n orthonormal bases. In Sec. 4 we defne the problem of fndng optmal workload-based wavelet synopses n terms of weghted nner product, and solve t usng an orthonormal bass. In Sec. 5 we descrbe the optmal algorthm for mnmzng WB-MSE, whch s based on the constructon of Sec. 4. In Sec. 6 we extend the algorthm to work for the WB-MRE. In Sec. 7 we present expermental results, and n Sec. 8 we draw our conclusons. 2 Wavelets bascs In ths secton we wll start by presentng the Haar wavelets and contnue wth presentng wavelet based synopses, obtaned by thresholdng process, descrbed n Sec The error tree structure wll be presented next (Sec. 2.3), along wth the descrpton of the reconstructon of orgnal data from the wavelet synopses n Sec Wavelets are a mathematcal tool for the herarchcal decomposton of functons n a spaceeffcent manner. Wavelets represent a functon n terms of a coarse overall shape, plus detals that 4

5 range from coarse to fne. Regardless of whether the functon of nterest s an mage, a curve, or a surface, wavelets offer an elegant technque for representng the varous levels of detal of the functon n a space-effcent manner. 2.1 One-dmensonal Haar wavelets Haar wavelets are conceptually the smplest wavelet bass functons, and were thus used n prevous works of wavelet synopses. They are fastest to compute and easest to mplement. We focus on them for purpose of exposton n ths paper. To llustrate how Haar wavelets work, we wll start wth a smple example borrowed from [16]. Suppose we have one-dmensonal sgnal of = 8 data tems: S = [2, 2, 0, 2, 3, 5, 4, 4]. We wll show how the Haar wavelet transform s done over S. We frst average the sgnal values, parwse, to get a new lower-resoluton sgnal wth values [2, 1, 4, 4]. That s, the frst two values n the orgnal sgnal (2 and 2) average to 2, and the second two values 0 and 2 average to 1, and so on. We also store the parwse dfferences of the orgnal values (dvded by 2) as detal coeffcents. In the above example, the four detal coeffcents are (2 2)/2 = 0, (0 2)/2 = 1, (3 5)/2 = 1, and (4 4)/2 = 0. It s easy to see that the orgnal values can be recovered from the averages and dfferences. Ths was one phase of the Haar wavelet transform. By repeatng ths process recursvely on the averages, we get the Haar wavelet transform (Table 1). We defne the wavelet transform (also called wavelet decomposton) of the orgnal eghth-value sgnal to be the sngle coeffcent representng the overall average of the orgnal sgnal, followed by the detal coeffcents n the order of ncreasng resoluton. Thus, for the one-dmensonal Haar bass, the wavelet transform of our sgnal s gven by S = [2 3 4, 1 1 4, 1 2, 0, 0, 1, 1, 0] Resoluton Averages Detal Coeffcents 8 [2, 2, 0, 2, 3, 5, 4, 4] 4 [2, 1, 4, 4] [0,-1,-1, 0] 2 [1.5, 4] [0.5, 0] 1 [2.75] Table 1: Haar Wavelet Decomposton The ndvdual entres are called the wavelet coeffcents. The wavelet decomposton s very effcent computatonally, requrng only O() CPU tme and O(/B) I/Os to compute for a sgnal of values, where B s the dsk-block sze. o nformaton has been ganed or lost by ths process. The orgnal sgnal has eght values, and so does the transform. Gven the transform, we can reconstruct the exact sgnal by recursvely addng and subtractng the detal coeffcents from the next-lower resoluton. In fact we have transformed the sgnal S nto a representaton wth respect to another bass of R 8 : The Haar wavelet bass. A detaled dscusson can be found, for example, n [19]. 2.2 Thresholdng Gven a lmted amount of storage for mantanng a wavelet synopss of a data array A (or equvalently a vector S), we can only retan a certan number M of the coeffcents stored n 5

6 the wavelet decomposton of A. The remanng coeffcents are mplctly set to 0. The goal of coeffcent thresholdng s to determne the best subset of M coeffcents to retan, so that some overall error measure n the approxmaton s mnmzed. One advantage of the wavelet transform s that n many cases a large number of the detal coeffcents turn out to be very small n magntude. Truncatng these small coeffcents from the representaton (.e., replacng each one by 0) ntroduces only small errors n the reconstructed sgnal. We can approxmate the orgnal sgnal effectvely by keepng only the most sgnfcant coeffcents. For a gven nput sequence d 0,..., d 1, we can measure the error of approxmaton n several ways. Let the th data value be d. Let q be the th pont query, whch t s value s d. Let ˆd be the estmated result of d. We use the followng error measure for the absolute error over the th data value: e = e(q ) = d ˆd Once we have the error measure for representng the errors of ndvdual data values, we would lke to measure the norm of the vector of errors e = (e 0,..., e 1 ). The standard way s to use the L 2 norm of e dvded by whch s called the mean squared error: MSE(e) = e = 1 1 e 2 We would use the terms MSE and L 2 norm nterchangeably durng our development snce they are completely equvalent, to a postve multplcatve constant. The basc thresholdng algorthm, based on Parseval s formula, s as follows: let α 0,..., α 1 be the wavelet coeffcents, and for each α let level(α ) be the resoluton level of α. The detal coeffcents are normalzed by dvdng each coeffcent by 2 level(a ) reflectng the fact that coeffcents at the lower resolutons are less mportant than the coeffcents at the hgher resolutons. Ths process actually turns the wavelet coeffcents nto an orthonormal bass coeffcents (and s thus called normalzaton ). The M largest normalzed coeffcents are retaned. The remanng M coeffcents are mplctly replaced by zero. Ths determnstc process provably mnmzes the L 2 norm of the vector of errors defned above, based on Parseval s formula (see Sec. 3). 2.3 Error tree The wavelet decomposton procedure followed by any thresholdng can be represented by an error tree [16]. Fg. 1 presents the error tree for the above example. Each nternal node of the error tree s assocated wth a wavelet coeffcent, and each leaf s assocated wth an orgnal sgnal value. Internal nodes and leaves are labelled separately by 0, 1,..., 1. For example, the root s an nternal node wth label 0 and ts node value s 2.75 n Fg. 1. For convenence, we shall use node and node value nterchangeably. The constructon of the error tree exactly mrrors the wavelet transform procedure. It s a bottom-up process. Frst, leaves are assgned orgnal sgnal values from left to rght. Then wavelet coeffcents are computed, level by level, and assgned to nternal nodes. 2.4 Reconstructon of orgnal data Gven an error tree T and an nternal node t of T, t a 0, we let leftleaves(t) (rghtleaves(t)) denote the set of leaves (.e., data) nodes n the subtree rooted at t s left (resp., rght) chld. Also, gven any (nternal or leaf) node u, we let path(u) be the set of all (nternal) nodes n T that are 6

7 Fgure 1: Error tree for = 8 proper ancestors of u (.e., the nodes on the path from u to the root of T, ncludng the root but not u) wth nonzero coeffcents. Fnally, for any two leaf nodes d l and d k we denote d(l : h) as the range sum k =l d Usng the error tree representaton T, we can outlne the followng reconstructon propertes of the Haar wavelet decomposton [16]: Sngle value. The reconstructon of any data value d depends only on the values of the nodes n path(d ). d = α j path(d ) δ j α j where δ j = +1 f d leftleaves(α j ) or j = 0, and δ j = 1 otherwse Range sum. An nternal node α j contrbutes to the range sum d(l : h) only f α j path(d l ) path(d k ). d(l : h) = α j path(d l ) path(d h ) x j where { (h l) αj f j = 0 x j = ( leftleaves(α j, l : h) rghtleaves(α j, l : h) ) α j otherwse 7

8 and where leftleaves(α j, l : h) = leftleaves(α j ) {d l, d l+1,..., d h } (.e., the ntersecton of leftleaves(α j ) wth the summaton range) and rghtleaves(α j, l : h) s defned smlarly. Thus, a reconstructon of a sngle data values nvolves the summaton of at most log + 1 coeffcents, and reconstructng a range sum nvolves the summaton of at most 2 log + 1 coeffcents, regardless of the wdth of the range. 3 The bascs of our development 3.1 Workload-based error metrcs Let D = (d 0,..., d 1 ) be a sequence wth = 2 j values. Denote the set of pont queres as Q = (q 0,..., q 1 ), where q s a query whch ts answer s d. Let a workload W = (c 0,..., c 1 ) be a vector of weghts that represents the probablty dstrbuton from whch future pont queres are to be generated. Let (u 0,..., u 1 ) be a bass of R, than D = α u. We can represent D by a vector of coeffcents (α 0,..., α 1 ). Suppose we want to approxmate D usng a subset of the coeffcents S {α 0,..., α 1 } where S = M. Then, for any subset S we can defne a weghted norm W L 2 wth respect to S, that provdes a measure for the errors expected for queres drawn from the probablty dstrbuton represented by W, when usng S as a synopss. S s then referred to as a workload-based wavelet synopss. Denote ˆd as an approxmaton of d usng S. There are two standard ways to measure the error over the th data value (equvalently, pont query): The absolute error: e a () = e a (q ) = d ˆd ; and the relatve error: e r () = e r (q ) = d ˆd max{ d,s}, where s s a postve bound that prevents small values from domnatng the relatve error. Whle the general (non-workload-based) approach s to reduce the L 2 norm of the vector of errors (e 1,..., e ) (where e = e a () or e = e r ()), here we would generalze the L 2 norm to reflect the query workload. Gven a workload W that conssts of all the queres probabltes c 1,..., c (where c s the probablty that q appears), the weghted-l 2 norm of the vector of (absolute or relatve) errors e = (e 1,..., e ) would be: W L 2 (e) = e w = 1 c e 2 where 0 < c 1, 1 c = 1. The ntuton behnd ths defnton of norm s to gve each data value d (or equvalently each pont query q ) some weght that represents ts sgnfcance. In the above case the square of the W L 2 norm s the expected squared error for a pont query that s drawn from the gven dstrbuton. In other words, to mnmze that norm of the error s to mnmze the expected squared error of an answer to a query. In general, the weghts gven to data values need not necessarly represent a probablty dstrbuton of pont queres, but any other sgnfcance measure. For example, n Sec. 6 we use weghts to solve the problem of mnmzng the mean-squared relatve error measured over the data values (the non-workload-based case). otce that t s a generalzaton of the MSE norm: by takng equal weghts for each query, meanng c = 1 for each and e = e a (), we get the standard MSE norm. We use the term workload-based error for the W L 2 norm of the vector of errors e. When e are absolute (resp. relatve) errors the workload-based error would be called the WB-MSE (resp. WB-MRE). 8

9 3.2 Optmal thresholdng n orthonormal bases The constructon s based on Parseval s formula, and a known theorem that results from t (Thm. 1) Parseval s formula. Let V be a vector space, where v V s a vector and {u 0,..., u 1 } s an orthonormal bass of V. We can express v as v = 1 α u. Then v 2 = 1 α 2 (1) An M-term approxmaton s acheved by representng v usng a subset of coeffcents S {α 0,..., α 1 } where S = M. The error vector s than e = / S α u. By Parseval s formula, e 2 = / S α2. Ths proves the followng theorem. Theorem 1 (Parseval-based optmal thresholdng) Let V be a vector space, where v V s a vector and {u 0,..., u 1 } s an orthonormal bass of V. We can represent v by {α 0,..., α 1 } where v = 1 α u. Suppose we want to approxmate v usng a subset S {α 0,..., α 1 } where S = M. Pckng the M largest coeffcents to S mnmzes the L 2 norm of the error vector, over all possble subsets of M coeffcents. Gven an nner-product, based on ths theorem one can easly fnd an optmal synopses by choosng the largest M coeffcents. 3.3 Optmalty over enhanced wavelet synopses otce that n the prevous secton we lmted ourselves to pckng subsets of coeffcents wth orgnal values from the lnear combnaton that spans v (as s usually done). In case {u 0,..., u 1 } s a wavelet bass, these are the coeffcents that results from the wavelet transform. We next show that the optmal thresholdng accordng to Thm. 1 s optmal even accordng to an enhanced defnton of M-term approxmaton. We defne enhanced wavelet synopses as wavelet synopses that allow arbtrary values to the retaned wavelet coeffcents, rather than the orgnal values that resulted from the transform. The set of possble standard synopses s a subset of the set of possble enhanced synopses, and therefore an optmal synopss accordng to the standard defnton s not necessarly optmal accordng to the enhanced defnton. Theorem 2 When usng an orthonormal bass, choosng the largest M coeffcents wth orgnal values s an optmal enhanced wavelet synopses. Proof : The proof s based on the fact that the bass s orthonormal. It s enough to show that gven some synopss of M coeffcents wth orgnal values, any change to the values of some subset of coeffcents n the synopss would only make the approxmaton error larger: Let u 1,..., u be an orthonormal bass and let v = α 1 u α u be the vector we would lke to approxmate by keepng only M wavelet coeffcents. Wthout loss of generalty, suppose we choose the frst M coeffcents and have the followng approxmaton for v: ṽ = M =1 α u. Accordng to Parseval s formula e 2 = =M+1 α 2 snce the bass s orthonormal. ow suppose we would change the values of some subset of j retaned coeffcents to new values. Let us see that due to the orthonormalty of the bass t would only make the error larger. Wthout loss of generalty we 9

10 would change the frst j coeffcents, meanng, we would change α 1,..., α j to be α 1,..., α j. In ths case the approxmaton would be ṽ = j =1 α u + M =j+1 α u. The approxmaton error would be v ṽ = j =1 (α α ) u + =M+1 α u. It s easy to see that the error of approxmaton would be: e 2 = v ṽ, v ṽ = j =1 (α α )2 + =M+1 α 2 > =M+1 α 2. 4 The workload-based nner product In ths secton, we defne the problem of fndng an optmal workload-based synopses n terms of a weghted-nner-product space, and solve t relyng on ths constructon. Here we deal wth the case where e are the absolute errors (the algorthm mnmzes the WB-MSE). An extenson to relatve errors (WB-MRE) s ntroduced n Sec. 6 Our development s as follows: 1. Transformng the data vector D nto an equvalent representaton as a functon f n a space of pecewse constant functons over [0, 1). (Sec. 4.1) 2. Defnng the workload-based nner product. (Sec. 4.2) 3. Usng the nner product to defne an L 2 norm, showng that the newly defned norm s equvalent to the weghted L 2 norm (W L 2 ). (Sec. 4.3) 4. Defnng a weghted Haar bass whch s orthonormal wth respect to the new nner product. (Sec. 4.4) Based on Thm. 1 and Thm. 2 one can easly fnd an optmal workload-based wavelet synopses wth respect to a weghted Haar wavelet bass. 4.1 Transformng the data vector nto a pecewse constant functon We assume that our approxmated data vector D s of sze = 2 j. As n [19], we treat sequences (vectors) of 2 j ponts as pecewse constant functons defned on the half-open nterval [0, 1). In order to do so, we wll use the concept of a vector space from lnear algebra. A sequence of one pont s just a functon that s constant over the entre nterval [0, 1); we ll let V 0 be the space of all these functons. A sequence of 2 ponts s a functon that has two constant parts over the ntervals [0, 1 2 ) and [ 1 2, 1). We ll call the space contanng all these functons V 1. If we contnue n ths manner, the space V j wll nclude all pecewse constant functons on the nterval [0, 1), wth the nterval dvded equally nto 2 j dfferent sub-ntervals. We can now thnk of every one-dmensonal sequence D of 2 j values as beng an element, or vector f, n V j. 4.2 Defnng a workload-based nner product The frst step s to choose an nner product defned on the vector space V j. Snce we want to mnmze a workload based error (and not the regular L 2 error), we started by defnng a new workload based nner product. The new nner product s a generalzaton of the standard nner product. It s a sum of = 2 j weghted standard products; each of them s defned over an nterval of sze 1 : f, g = ( 1 c f (x) g (x) dx ) where 0 < c 1, 1 c = 1 (2) 10

11 Lemma 1 f, g s an nner product. Proof : Let us check that t satsfes the condtons of an nner product: f, g : V j V j R Symmetrc: Blnear: 1 f, g = c 1 f(x)g(x)dx = c g(x)f(x)dx = g, f 1 af 1 + bf 2, g = c (af 1 + bf 2 )(x)g(x)dx = 1 c 1 a c 1 af 1 (x)g(x)dx + 1 f 1 (x)g(x)dx + b c c bf 2 (x)g(x)dx = f 2 (x)g(x)dx = a f 1, g + b f 2, g and also wth a smlar proof. f, ag 1 + bg 2 = a f, g 1 + b f, g 2 postve defnte: 1 f, f = c 1 f(x)f(x)dx = c and f, f = 0 ff f 0 snce c > 0 for each f 2 (x)dx 0 As mentoned before, a coeffcent c represents the probablty (or a weght) for the th pont query (q ) to appear. otce that the answer of whch s the th data value, whch s functon value at the th nterval. When all coeffcents c are equal to 1 (a unform dstrbuton of queres), we get the standard nner product, and therefore ths s a generalzaton of the standard nner product. 11

12 4.3 Defnng a norm based on the nner product Based on that nner product we defne an nner-product-based (IPB) norm: f IP B = f, f (3) Lemma 2 The norm f IP B measured over the vector of absolute errors s the weghted L 2 norm of ths vector,.e e 2 IP B = 1 c e 2 = e 2 w. Proof : Let f V j be a functon and let f V j be a functon that approxmates f. let the error functon be e = f f V j. Then the norm of the error functon s: 1 ( c f 1 e 2 IP B = e, e = c 1 c ( ) f ( 1 e 2 (x) dx = )) 2 = 1 1 c c (f e (x) e (x) dx = ( f f ) 2 (x) dx = ( ) ( )) 2 1 f = c e 2 where e s the error on the th functon value. Ths s exactly the square of the prevously defned weghted L 2 norm. otce that when all coeffcents are equal to 1 we get the regular L 2 norm, and therefore ths s a generalzaton of the regular L 2 norm (MSE). Our purpose s to mnmze the workload based error whch s the W L 2 norm of the vector of errors. 4.4 Defnng an orthonormal bass At ths stage we would lke to use Thm. 1. The next step would thus be fndng an orthonormal (wth respect to a workload based nner product) wavelet bass for the space V j. The bass s a Weghted Haar Bass. For each workload-based nner product (defned by a gven query workload) there s correspondng orthonormal weghted Haar bass, and our algorthm fnds ths bass n lnear tme, gven the workload of pont queres. We descrbe the bases here, and see how to fnd a bass based on a gven workload of pont queres. We wll later use ths nformaton n the algorthmc part. In order to buld a weghted Haar bass, we take the Haar bass functons and for the k th bass functon we multply ts postve (resp. negatve) part by some x k (resp. x k ). We would lke to choose such x k and y k so that we get an orthonormal bass wth respect to our nner product. Let us llustrate t by drawng. Instead of usng Haar bass functons (Fg. 2), we use functons of the knd llustrated n Fg. 3, where x k and y k are not necessarly (and probably not) equal, so our bass looks lke the one n (Fg. 4). How do we choose x k and y k? Let u k be some Haar bass functon as descrbed above. Let [a k0, a k1 ) be the nterval over whch the bass functon s postve and let [a k1, a k2 ) be the nterval over whch the functon s negatve. 1 Recall that a k0, a k1 and a k2 are both multples of and therefore the nterval precsely contans some number of contnuous ntervals of the form [, +1 ] (also a k 1 = a k 0 +a k2 2 ). Moreover, the sze of the nterval over whch the functon s postve (resp. negatve) s 1 for some < j (As 2 we remember, = 2 j ). Recall that for the th nterval of sze 1, meanng [, +1 ) there s a 12

13 Fgure 2: An example for a Haar bass functon correspondng weght coeffcent c whch s the coeffcent that s used n the nner product. otce that each Haar bass functon s postve (negatve) over some number of (whole) such ntervals. We can therefore assocate the sum of coeffcents of the ntervals under the postve (negatve) part of the functon wth the postve (negatve) part of the functon. Let us denote the sum of weght coeffcents (c s) correspondng to ntervals that are under the postve (resp. negatve) as l k (resp. r k ). Lemma 3 Suppose for each Haar bass functon v k we choose x k and y k such that x k = rk l k r k + l 2 k y k = lk l k r k + r 2 k and multply the postve (resp. negatve) part of v k by x k (resp. y k ); by dong that we get an orthonormal set of = 2 j functons, meanng we get an orthonormal bass. Proof : We frst show that when takng x k and y k such that x k r k = y k lk the bass s orthogonal. It s enough to show that the nner product of any v k and a constant functon s 0. In order to see why that suffces: Let u and v be some 2 Haar bass functons and let I u and I v be the ntervals over whch u and v are dfferent from zero, respectvely. If there s some pont (nterval) over whch both functons are dfferent from zero, then by the Haar bass defnton we get ether I u I v or I v I u. Suppose I v I u then I v s contaned only n the negatve part of I u or only n the postve part of I u, agan, by the Haar bass defnton. Consequently, when multplyng u and v by an nner product, there are two possble results: ether there s no pont that both functons are dfferent from zero, or the non-zero nterval of one functon s completely contaned n a constant part of the other functon. Obvously ths goes for our Weghted Haar Bass as well. ow, let us verfy that the nner product of some v k wth a constant functon f (x) = m s zero: 1 v k, f = c m { v k( )>0} c 1 v k (x) f (x) dx = 1 m c v k (x) dx + m 13 v k (x) dx = { v k( )<0} c c v k (x) mdx = v k (x) dx =

14 Fgure 3: An example for a Weghted Haar Bass functon Fgure 4: the weghted Haar Bass along wth the workload coeffcents, each coeffcent under ts correspondng nterval. For each level, the functons of the level are dfferent from zero over ntervals of equal sze. m xk m { v k( )>0} { v k( )>0} c m yk c x k m { v k( )<0} { v k( )<0} c y k = c = m (x k l k y k r k ) = 0 ow, n order to get an orthonormal bass all we have to do s to normalze those bass functons. 14

15 Let us compute the norm of some v k whose postve part s set to x k and ts negatve part s set to y k : 1 v k, v k = c v k 2 (x) dx = { v k( )>0} x2 k c { v k( )>0} { v k( )>0} v 2 k (x) dx + c x 2 k + c + y2 k { v k( )<0} { v k( )<0} { v k( )<0} From the orthogonalty condton we wll take y k = x kl k r k : r k c c y 2 k = c = x 2 kl k + y 2 kr k ( ) x 2 kl k + ykr 2 k = 1 x 2 xk l 2 k kl k + r k = 1 x 2 kl k + x2 k l2 k x 2 k So we wll take: ( ) l k + l2 k rk 2 = 1 x 2 1 k = x k = 1 l k + l2 k r 2 l k + l2 k k x k = rk l k r k + l 2 k y k = lk l k r k + r 2 k r 2 k = v 2 k (x) dx = r k = 1 rk l k r k + l 2 k There s a specal case whch s the computng of the constant bass functon (whch represents the total weghted average) v 0 (x) = const. We would lke the norm of ths functon to be 1. We just have to put x k = y k n the equaton x 2 k l k + yk 2r k = 1 and get f (x) = x k = y k = 1 l k +r k = const. Agan, notce that had all the workload coeffcents been equal (c = 1 ) we would get the standard Haar bass used to mnmze the standard L 2 norm. Agan, notce that had all the workload coeffcents been equal (c = 1 ) we would get the standard Haar bass used to mnmze the standard L 2 norm. As we have seen, ths s an orthonormal bass to our functon space. In order to see that t s a wavelet bass, we can notce that for each k = 1,..., j, the frst 2 k functons are an orthonormal set belongng to V k (ts dmenson s 2 k ) and whch s therefore a bass of V k. 5 The algorthm for WWW transform In ths secton we descrbe the algorthmc part. Gven a workload of pont queres and a data vector to be approxmated, we buld workload-based wavelet synopses of the data vector usng a weghted Haar bass. The algorthm has two parts: 1. Computng effcently a Weghted Haar bass, gven a workload of pont queres. (Sec. 5.1) 2. Computng effcently the Weghted Haar Wavelet Transform wth respect to the chosen bass. (Sec. 5.2) 15

16 5.1 Computng effcently a weghted Haar bass ote that at ths pont we already have a method to fnd an orthonormal bass wth respect to a gven workload based nner product. Recall that n order to know x k and y k for every bass functon we need to know the correspondng l k and r k. We are gong to compute all those partal sums n lnear tme. Suppose that the bass functons are arranged n an array lke n a bnary tree representaton. The hghest resoluton functons are at ndexes 2,..., 1, whch are the lowest level of the tree. The next resoluton level functons are at ndexes 4,..., 2 1, and so on, untl the constant bass functon s n ndex 0. otce that for the lowest level (hghest resoluton) functons (ndexes 2,..., 1) we already have ther l k s and r k s. These are exactly the workload coeffcents. It can easly be seen n Fg. 4 for the lower four functons. otce that after computng the accumulated sums for the functons at resoluton level, we have all the nformaton to compute the hgher level functons: let u k be a functon at resoluton level and u 2k, u 2k+1 be at level + 1, where ther supports ncluded n u k s support (u k s ther ancestor n the bnary tree of functons). We can use the followng formula for computng l k and r k : l k = l 2k + r 2k r k = l 2k+1 + r 2k+1 It can be seen n the example of Fg. 4. Thus, we can compute n one pass only the lowest level, and buld the upper levels bottom-up (n a way somewhat smlar to the Haar wavelet transform). At the end of a phase n the algorthm (a phase would be computng the functons of a specfc level) we would keep a temporary array holdng all the parwse sums of all the l k s and r k s from that phase and use them for computng the next phase functons. Clearly, the runnng tme s = O (). The number of I/Os s O (/B) I/Os (where B s the block sze of the dsk) snce the process s smlar to the computaton Haar wavelet transform. A pseudo-code of the computaton can be found n Fg. 14. The createf uncton() functon takes two sums of weght coeffcents correspondng to the functon s postve part and to the functon s negatve part, and buld a functon whose postve (resp. negatve) part s value s x k (resp. y k ) usng the followng formulae: x k = rk l k r k + l 2 k y k = lk l k r k + r 2 k 5.2 Computng a weghted Haar wavelet transform Gven the bass we would lke to effcently perform the wavelet transform wth respect to that bass. Let us look at the case of = 2 (Fg. 5). Suppose we would lke to represent the functon n Fg. 6. It s easy to compute the followng result (denote α as the coeffcent of f ): α 0 = yv 0 + xv 1 x + y α 1 = v 0 v 1 x + y (by solvng 2x2 matrx). otce that the coeffcents are weghted averages and dfferences, snce the transform generalzes the standard Haar transform (by takng x = y = 2 we get the standard Haar transform). It s easy to reconstruct the orgnal functon from the coeffcents: v 0 = α 0 + xα 1 v 1 = α 0 yα 1 Ths mples a straghtforward method to compute the wavelet transform (whch s I/O effcent as well) accordng to the way we compute a regular wavelet transform wth respect to the Haar 16

17 Fgure 5: Weghted Haar Transform wth two functons Fgure 6: a smple functon wth 2 values over [0, 1) bass: we go over the data, and compute the weghted dfferences whch are the coeffcents of the bottom level functons. We keep the weghted averages, whch can be represented solely by the rest of the bass functons (the lower resoluton functons - as n the regular Haar wavelet transform), n another array. We repeat the process over the averages tme and tme agan untl we have the overall average, whch s added to our array as the coeffcent of the constant functon (v 0 (x) = const). Whle computng the transform, n addton to readng the values of the sgnal, we need to read the proper bass functon that s relevant for the current stage (n order to use the x k and y k of the functon that s employed n the above formula). Ths s easy to do, snce all the functons are stored n an array F and the ndex of a functon s determned by the teraton number and s dentcal to the ndex of the correspondng currently computed coeffcent. A pseudo code of the algorthm s can be found n Fg. 15. As we know, the Haar wavelet transform s a lnear algorthm. The steps of our algorthm are dentcal to the steps of the Haar algorthm, wth the addton of readng the data at F [] (the x k and y k of the functon) durng the th teraton. Therefore the I/O complexty of that phase remans O (/B) (B s the dsk block sze) wth O () runnng tme. After havng the coeffcent of the orthonormal bass we would keep the largest M coeffcents, along wth ther correspondng M functons, and throw the smallest coeffcents relyng on Thm. 1 We can do t n lnear tme usng the M-approxmate quantle algorthm [13]. 6 Optmal synopss for mean relatve error We next show a varant of the weghted-wavelets-based algorthm mnmzes the weghted L 2 norm of the vector of relatve errors, weghted by the query workload, usng weghted wavelets. We demonstrate another use of gvng weghts to data values, used to mnmze the mean-squaredrelatve-error measured over the data values. Recall that n order to mnmze the weghted L 2 norm of relatve errors, we need to mnmze ( ) 2 =1 d c ˆd d (actually ( 2 d =1 c ˆd max{d,s}), but the dea s the same). Snce D = d 1,..., d s part of the nput of the algorthm, t s fxed throughout the algorthm s executon. We can thus 17

18 ( dvde each c by d 2 and get a new vector of weghts: W = c 1,..., c d 2 1 d 2 ) ( d ˆd results, and usng the new vector of weghts we mnmze c =1 d 2 whch s the W L 2 norm of relatve errors. otce that n the case b = 1. Relyng on our prevous ) ( ) 2 2 = =1 d c ˆd d, (the unform case) the algorthm mnmzes the mean-relatve-error over all data values. As far as we know, ths s the frst algorthm that mnmzes the mean-relatve-error over the data values. 7 Experments In ths secton we demonstrate the advantage obtaned by our workload-based wavelet synopses. All our experments were done usng the τ-synopses [15] system. For our expermental studes we used both synthetc and real-lfe data sets. The synthetc data-sets are taken from the TPCH data ( and the real-lfe data-sets are taken from the Forest CoverType data provded by KDD Data of the Unversty of Calforna ( The data-sets are: 1. TPCH - 2. KDD - TPCH1 - Data attrbute 1 from table ORDERS, fltered by attrbute O CUSTKEY, whch contans about 150,000 dstnct values. KDD Data attrbute Aspect from table CovTypeAgr fltered by Elevaton from the KDD data, wth a total of 2048 dstnct values. The sets of queres were generated ndependently by a Zpf dstrbuton generator. We used queres of dfferent skews, dstrbuted by several Zpf parameter values. We took here the zpf parameters 0.2, 0.5 and 0.8, n order to test the behavor of the synopses under dfferent skews, whch range from close-to-unform to hghly skewed. The sets of queres contaned queres over each data set. In Fg. 7 we compared the standard wavelet synopss from [16] wth our WB-MSE wavelet synopss. The standard synopss s depcted n sold lne. We measured the WB-MSE as a functon of synopss sze, measured as the number of coeffcents n the synopss. For each M = 10, 20,..., 100 we bult synopses of sze M usng both methods and compared the WB-MSE error, measured wth respect to a gven workload of queres. The workload contaned 5000 Zpf dstrbuted pont queres, wth a Zpf parameter of 0.5. The data-set was the TPCH1 data. As the synopss sze ncreases, the error of the workload-based algorthm becomes much smaller than the error of the standard algorthm. The reason for ths s that synopses of szes 10,...,100 are very small wth respect to a data of sze 150,000. Snce the standard algorthm does not take the query workload nto account, the results are more or less the same for all synopses szes n the experment. However, the workload-based synopss adapts tself to the query workload, whch s of sze All the data values whch are not quered by the workload are gven very small mportance weghts, so the synopss actually has to be accurate over less than 5000 values. Thus, there s a sharp decrease n the error of the workload-based algorthm as the synopss sze ncreases. In Fg. 8 we used a smlar experment, ths tme wth the KDD2048 data. The standard synopss s agan depcted n sold lne. As n the prevous experment, we measured the WB-MSE as a functon of synopss sze. For each M = 20, 40,..., 200 we bult synopses of sze M usng both 18

19 methods and compared the WB-MSE error, measured wth respect to a gven workload of queres. The workload contaned 5000 Zpf dstrbuted pont queres, wth a Zpf parameter 0.5. The data was the KDD2048 data, of sze We see that for each synopss sze the error of the standard algorthm s approxmately twce the error of the workload-based algorthm. The reason for ths s that here the query workload s larger than the data-set, n contrast to the prevous experment. Thus, most of the data s quered by the workload, so the mportance weghts gven to data values are more unform than n the prevous experment. Therefore, the error dfference s smaller than n the prevous experment, snce the advantage of the workload-based algorthm becomes more sgnfcant as the workload gets more skewed. However, snce the workload-based synopss adapts tself to the workload, the error s stll better than the standard synopss, whch assumes unform dstrbuton. In Fg. 9 we compared the standard wavelet synopss from [16] and the adaptve-greedy workloadbased wavelet synopss from [14] wth our WB-MRE wavelet synopss. The standard synopss s depcted n dotted lne wth x s. Snce t s hard to dstngush between the other two synopses n ths resoluton level, we zoom nto ths fgure n Fg. 10. We measured the WB-MRE as a functon of synopss sze, measured as the number of coeffcents n the synopss. For each M = 20, 40,..., 200 we bult synopses of sze M usng the three methods and compared the WB-MRE error, measured wth respect to a gven workload of queres. The workload contaned 3000 Zpf dstrbuted pont queres, wth a Zpf parameter of 0.5. The data-set was the KDD2048 data. Snce the standard algorthm does not take nto account the query-workload and s not adapted for relatve errors, ts approxmaton error s more than tmes larger than the approxmaton errors of the workloadbased algorthms, for each synopss sze. In Fg. 10 we compare the adaptve-greedy workload-based synopss from [16] wth our WB- MRE synopss. The adaptve-greedy synopss s depcted n sold lne. We measured the WB-MRE as a functon of synopss sze, measured as the number of coeffcents n the synopss. For each M = 20, 40,..., 200 we bult synopses of sze M usng the two methods and compared the WB-MRE error, measured wth respect to a gven workload of queres. The workload contaned 5000 Zpf dstrbuted pont queres, wth a Zpf parameter of 0.5. The data-set was the KDD2048 data. For each synopss sze, the approxmaton error of the adaptve-greedy s tmes larger than the error of our WB-MRE algorthm. In Fg. 11 we depct the WB-MRE as a functon of synopss sze, for three gven query workloads, dstrbuted wth Zpf parameters 0.2, 0.5 and 0.8. The data-set was the KDD2048 data-set, and the workloads conssted 5000 queres. For each of the gven three workloads we buld synopses of sze M = 50, 100,..., 500 and depcted the WB-MRE as a functon of synopss sze. It can be seen that many wavelet coeffcents can be gnored before the error sgnfcantly ncreases. Ths s a desred feature for any synopss. For example, for synopses of sze 500 the WB-MRE s smaller than 0.05, and for synopses of sze 250 the WB-MRE s smaller than 0.1. It can also be seen that the hgher the skew, the more accurate the workload-based synopses. The reason s that when the skew gets hgher, the synopss should be accurate over a smaller number of data values. In Fg. 12 we compare the standard algorthm from [14] wth our WB-MRE algorthm n a dfferent way than before. We compare the rato between the approxmaton error of the standard algorthm and the approxmaton error of the WB-MRE algorthm, for dfferent workload skews. The comparson was done for three dfferent query workloads, dstrbuted wth dfferent Zpf parameters. The workloads contaned 5000 queres, dstrbuted wth Zpf parameters 0.2, 0.5 and 0.8 respectvely. The data-set was the KDD2048. For each gven workload we measured the error rato between the two synopses, for each synopses sze M = 50, 100, It s clearly seen 19

20 that the hgher the skew of the workload, the hgher the rato between the approxmaton errors of the synopses. The reason s than as the workload gets far from unform, the advantage of the workload-based algorthms naturally becomes more sgnfcant over the standard synopss, whch assumes unform workload. In Fg. 13 we show the robustness of the workload-based wavelet synopses to devatons from predefned workload. The experment addresses the problem of ncorrect future workload estmaton. When buldng our synopss, we assumed the queres would be dstrbuted as Zpf(0.2). We fxed the synopss sze, and bult our synopss. We then used the synopss to answer query workloads dstrbuted dfferent than expected, e.g wth Zpf parameters 0.3,0.4,... etc. The fgure depcts the WB-MRE as a functon of the dfference between the actual query dstrbuton and our estmated query dstrbuton (estmated as Zpf(0.2)). The skew dfference s the dfference between the actual Zpf parameter and the estmated Zpf parameter, accordng to whch we assumed the queres are dstrbuted. We show that small errors n the workload estmaton ntroduce only small errors n the qualty of the approxmaton, and that the error grows contnuously as the devaton from pre-defned workload ncreases. 8 Conclusons In ths paper we ntroduce the use of weghted wavelets for buldng optmal workload-based wavelet synopses. We present two tme-optmal and I/O-optmal algorthms for workload-based wavelet synopses, whch mnmze the WB-MSE and and the WB-MRE error measures, wth respect to any gven query workload. The advantage of optmal workload-based wavelet synopses, as well as ther robustness, were demonstrated by expermentatons. Recently, and ndependently of our work, Muthukrshnan [17] presented an optmal workloadbased wavelet synopss wth respect to the standard Haar bass. The algorthm for buldng the optmal synopss s based on dynamc programmng and takes O( 2 M/ log M) tme. As noted above, standard Haar bass s not orthonormal w.r.t. the workload-based error metrc, and an optmal synopss w.r.t. ths bass s not necessarly also an optmal enhanced wavelet synopss. Obtanng optmal enhanced wavelet synopses for the standard Haar wavelets may be an nterestng open problem. Also, as quadratc tme s too costly for massve data sets, t may be nterestng to obtan a tme effcent algorthm for such synopses. As far as approxmaton error s concerned, although n general optmal synopses w.r.t. the standard Haar and a weghted Haar bases are ncomparable, both bases have the same characterstcs. It would be nterestng to compare the actual approxmaton errors of the two synopses for varous data sets. Ths may ndeed be the subject of a future work. Acknowledgments: We thank Leon Portman for helpful dscussons and for hs assstance n settng up the experments on the τ-synopses system. We also thank Prof. ra Dyn for helpful dscussons regardng the wavelets theory. 20

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could