CIS 6930 Approxmate Quer Processg Paper Presetato Sprg 2004 - Istructor: Dr Al Dobra Overcomg Lmtatos of Samplg for Aggregato Queres Authors: Surajt Chaudhur, Gautam Das, Maur Datar, Rajeev Motwa, ad Vvek R arasaa ICDE 2001 Preseted b: Adréa Matsuaga ammatsu@ufledu) 2004, UFL-COE-ECE
Outle Itroducto The eed for Approxmate Quer Processg Issues wth uform samplg Solutos Outler-dexes Explotg workload formato Expermetal results
Itroducto Data aalss over large data s hard Data aaltcs ofte do ot eed exact aswers ballpark estmates are eough Examples O Le Aaltcal Processg OLAP)/Decso Support Eg what s the percet crease the sales of Wdows XP over last ear Calfora? Data Mg Buldg models eg decso trees) does ot requre precse couts Focus o Aggregate queres
Issues Lmtatos of uform samplg aswerg Aggregato queres: Data skew large data varace) Outler-dexes Low selectvt ad small groups Explotg workload formato
Data Skew Effect Example 99% Relato R 1% 10000 tuples) K C 1 1 1000 SumC) 109,900 1% uform sample 100 tuples) Extrapolate multpl b 100) Severe uderestmate f outler ot sample o tuple wth 1000: EstSUMC))10,000 R-9900 R-100 P 1 100 99 037 1 tuple wth 1000: EstSUMC))109,900 Severe overestmate f outler ot sample 2 or more tuples wth 1000: EstSUMC))209,800 EstSUMC))309,700 Probablt of 063 to get large error estmate!!!
Theorem 1 U e Y Y 1 S 1 ε R Relato of sze { 1, 2,, } Set of values assocated wth the tuples the relato U uform sample of s of sze wth stadard error: where S stadard devato Ubased estmator of the actual sum 1 ) 1 2 Y S
Theorem 1 - Proof U e Y Y 1 S Y Var e ) ε 1 ) ) 1 2 Y S Var S Var Var Y Var U U e 2 2 2 2 ) ) Y E P E E Y E U U e 1 1 ) ) Propertes of varace: For depedet radom varables) ) ) 2 X Var a ax Var X Var X Var ) ) Propertes of expectato: a a E )
Soluto 1: Outler Idexg To hadle data skew a aggregato quer The dea: Separate the outlers R O ) from the rest of the data or o-outlers R O ) to a outler dex Keep a uform radom sample of the remag data Use outler dex as well as radom sample to aswer queres
Outler Idexg Implemetato Pre-processg 1) Determe the outlers R O Quer Quer processg 3) Aggregate outlers A 1 R R O R O sample 2) Sample o-outlers Quer & extrapolate A2 4) Aggregate o-outlers ote: Sce DB cotet chage over tme, selecto of outlers dexes ad samples should be refreshed appropratel + A 5) Combe aggregates
Outler Selecto: Defto 1 For a sub-relato R R R) εr ) stadard error estmatg the sum of values R uform samplg followed b extrapolato) A optmal outler-dex R O R,C,τ) s defed as a sub-relato R O R: R O τ εr\r O ) m R R, R τ {εr\r )}
Outler Selecto: Theorem 2 Cosder a multset R { 1, 2,, } where the s are sorted order Let R O R be the subset such that: R O τ SR\RO) m R R, R τ {SR\R )} The exsts some 0 τ τ such that R O { 1 τ } { +τ +1-τ) }
Outler Selecto: Algorthm 1) Read the values colum C of the relato R Let { 1, 2,, } be the sorted order of the values appearg C each value correspods to a tuple) 2) For 1 to τ+1, compute E) S{, +1,, -τ+-1 }) 3) Let be the value of where E) takes ts mmum value The the outler-dex s the tuples that correspod to the set of values { j 1 j τ } { j +τ +1-τ) j } where τ -1 The algorthm depeds o computg stadard devatos Stadard devatos computed O1) tme for sertos ad deletos eg E+1) ca be computed from E), ad -τ+1 )
Outler Selecto: Example Relato R _ Y 1099 Y 109,900 10,000 tuples 99% 9900 tuples 1% 100 tuples 1000 1 E1) Outlers!!! For τ 100: E1) 999 E2) 1409 E3) 1725 E101) 999 E2) E101) CREATE VIEW c_otl_dx AS SELECT * from R WHERE C > 1000)
Low Selectvt ad Small Groups Effect Example Relato R Sample Quer wth group-b s Sample ma ot cota eve a sgle row that belogs to the sub-relato Quer wth low selectvt Sample ma ot cota eve a sgle row selected b the quer
Soluto 2: Explotg Workload Iformato To hadle low selectvt ad small groups The dea: Use weghted samplg Sample more from subsets of data that are small sze but are mportat have hgh usage) Explot DB access patter localt Usg pre-computed samples
Explotg Workload Iformato Steps: 1) Workload Collecto: obta a workload cosstg of represetatve queres agast the DB eg Mcrosoft SQL Server Profler) 2) Trace Quer Patters: aalze workload to obta parsed formato eg the set of selecto codtos that are posed) 3) Trace Tuple Usage: The executo of the workload reveals addtoal formato o usage of specfc tuples eg frequec of access to each tuple) Sce trackg ths formato at the level of tuples ca be expesve, t ca be kept at coarser graulart eg o page-level) For the expermets, assumed that a tuple t has weght w f the tuple t s requred to aswer w queres the workload) 4) Weghted Samplg: Perform samplg b takg to accout weghts of tuples step 3 The probablt to accept the sample s p w, where: w ' w / w eed to store the ormalzed weght w together wth the tuple sce ts verse multplcato factor) wll be used to aswer the aggregate quer j 1 j
Explotg Workload Iformato Whe weghted samplg based o workload formato works well? Access patter of queres are local We have a workload that s a good represetatve of future queres
Expermetal Setup Platform: Dell Precso 610 sstem wth a Petum III Xeo 450 MHz processor wth 128 MB RAM ad a exteral 23GB hard drve Databases: 100MB TPC-R databases TPC-R bechmark modfed to var the degree of skew determed b the Zpfa parameter z 5 dstrbuto, sce orgal data s geerated from a uform dstrbuto Workloads: radom quer geerato program wth sum aggregate fucto Parameters: a) skew of the data z) was vared over 1, 15, 2, 25, ad 3 b) the samplg fracto f) was vared over a wde rage from 1% to 100%, c) the storage for the outler-dex was vared over 1%, 5%, 10%, ad 20%; ad d) average over 3 rus Techques:USAMP: uform samplg WSAMP: weghted samplg WSAMP+OTLIDX: weghted samplg + outler-dexg
Expermetal Results
Expermetal Results
Expermetal Results
Questos? Thak ou!