Overcoming Limitations of Sampling for Aggregation Queries

Similar documents
CHAPTER VI Statistical Analysis of Experimental Data

Econometric Methods. Review of Estimation

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Simulation Output Analysis

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

Objectives of Multiple Regression

Lecture 3. Sampling, sampling distributions, and parameter estimation

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Summary of the lecture in Biostatistics

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Chapter 5 Properties of a Random Sample

TESTS BASED ON MAXIMUM LIKELIHOOD

= 1. UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Parameters and Statistics. Measures of Centrality

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

LECTURE - 4 SIMPLE RANDOM SAMPLING DR. SHALABH DEPARTMENT OF MATHEMATICS AND STATISTICS INDIAN INSTITUTE OF TECHNOLOGY KANPUR

4. Standard Regression Model and Spatial Dependence Tests

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

L5 Polynomial / Spline Curves

( ) ( ) ( ) f ( ) ( )

X ε ) = 0, or equivalently, lim

Chapter 3 Sampling For Proportions and Percentages

The Mathematical Appendix

Chapter -2 Simple Random Sampling

Chapter 8: Statistical Analysis of Simulated Data

Median as a Weighted Arithmetic Mean of All Sample Observations

Lecture 3 Probability review (cont d)

Section l h l Stem=Tens. 8l Leaf=Ones. 8h l 03. 9h 58

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Chapter -2 Simple Random Sampling

STA302/1001-Fall 2008 Midterm Test October 21, 2008

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Multiple Choice Test. Chapter Adequacy of Models for Regression

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Lecture Notes Forecasting the process of estimating or predicting unknown situations

BIOREPS Problem Set #11 The Evolution of DNA Strands

The expected value of a sum of random variables,, is the sum of the expected values:

Functions of Random Variables

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Some Notes on the Probability Space of Statistical Surveys

Handout #1. Title: Foundations of Econometrics. POPULATION vs. SAMPLE

Analyzing Two-Dimensional Data. Analyzing Two-Dimensional Data

2SLS Estimates ECON In this case, begin with the assumption that E[ i

PTAS for Bin-Packing

1 Onto functions and bijections Applications to Counting

ε. Therefore, the estimate

Introduction to local (nonparametric) density estimation. methods

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Random Variables and Probability Distributions

Measures of Dispersion

Maximum Likelihood Estimation

Class 13,14 June 17, 19, 2015

Chapter Two. An Introduction to Regression ( )

Outline. Point Pattern Analysis Part I. Revisit IRP/CSR

to the estimation of total sensitivity indices

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Introduction to Matrices and Matrix Approach to Simple Linear Regression

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.


Answer key to problem set # 2 ECON 342 J. Marcelo Ochoa Spring, 2009

Block-Based Compact Thermal Modeling of Semiconductor Integrated Circuits

Chapter 2 - Free Vibration of Multi-Degree-of-Freedom Systems - II

Pseudo-random Functions

Lecture 8: Linear Regression

Continuous Distributions

MEASURES OF DISPERSION

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Pseudo-random Functions. PRG vs PRF

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

9.1 Introduction to the probit and logit models

Sequential Approach to Covariance Correction for P-Field Simulation

ESS Line Fitting

Simple Linear Regression

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Chapter 11 Systematic Sampling

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

ENGI 3423 Simple Linear Regression Page 12-01

Idea is to sample from a different distribution that picks points in important regions of the sample space. Want ( ) ( ) ( ) E f X = f x g x dx

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Dimensionality Reduction and Learning

STK4011 and STK9011 Autumn 2016

ENGI 4421 Propagation of Error Page 8-01

Convergence of the Desroziers scheme and its relation to the lag innovation diagnostic

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

Wendy Korn, Moon Chang (IBM) ACM SIGARCH Computer Architecture News Vol. 35, No. 1, March 2007

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Lecture 2: The Simple Regression Model

A Study of the Reproducibility of Measurements with HUR Leg Extension/Curl Research Line

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

COV. Violation of constant variance of ε i s but they are still independent. The error term (ε) is said to be heteroscedastic.

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Transcription:

CIS 6930 Approxmate Quer Processg Paper Presetato Sprg 2004 - Istructor: Dr Al Dobra Overcomg Lmtatos of Samplg for Aggregato Queres Authors: Surajt Chaudhur, Gautam Das, Maur Datar, Rajeev Motwa, ad Vvek R arasaa ICDE 2001 Preseted b: Adréa Matsuaga ammatsu@ufledu) 2004, UFL-COE-ECE

Outle Itroducto The eed for Approxmate Quer Processg Issues wth uform samplg Solutos Outler-dexes Explotg workload formato Expermetal results

Itroducto Data aalss over large data s hard Data aaltcs ofte do ot eed exact aswers ballpark estmates are eough Examples O Le Aaltcal Processg OLAP)/Decso Support Eg what s the percet crease the sales of Wdows XP over last ear Calfora? Data Mg Buldg models eg decso trees) does ot requre precse couts Focus o Aggregate queres

Issues Lmtatos of uform samplg aswerg Aggregato queres: Data skew large data varace) Outler-dexes Low selectvt ad small groups Explotg workload formato

Data Skew Effect Example 99% Relato R 1% 10000 tuples) K C 1 1 1000 SumC) 109,900 1% uform sample 100 tuples) Extrapolate multpl b 100) Severe uderestmate f outler ot sample o tuple wth 1000: EstSUMC))10,000 R-9900 R-100 P 1 100 99 037 1 tuple wth 1000: EstSUMC))109,900 Severe overestmate f outler ot sample 2 or more tuples wth 1000: EstSUMC))209,800 EstSUMC))309,700 Probablt of 063 to get large error estmate!!!

Theorem 1 U e Y Y 1 S 1 ε R Relato of sze { 1, 2,, } Set of values assocated wth the tuples the relato U uform sample of s of sze wth stadard error: where S stadard devato Ubased estmator of the actual sum 1 ) 1 2 Y S

Theorem 1 - Proof U e Y Y 1 S Y Var e ) ε 1 ) ) 1 2 Y S Var S Var Var Y Var U U e 2 2 2 2 ) ) Y E P E E Y E U U e 1 1 ) ) Propertes of varace: For depedet radom varables) ) ) 2 X Var a ax Var X Var X Var ) ) Propertes of expectato: a a E )

Soluto 1: Outler Idexg To hadle data skew a aggregato quer The dea: Separate the outlers R O ) from the rest of the data or o-outlers R O ) to a outler dex Keep a uform radom sample of the remag data Use outler dex as well as radom sample to aswer queres

Outler Idexg Implemetato Pre-processg 1) Determe the outlers R O Quer Quer processg 3) Aggregate outlers A 1 R R O R O sample 2) Sample o-outlers Quer & extrapolate A2 4) Aggregate o-outlers ote: Sce DB cotet chage over tme, selecto of outlers dexes ad samples should be refreshed appropratel + A 5) Combe aggregates

Outler Selecto: Defto 1 For a sub-relato R R R) εr ) stadard error estmatg the sum of values R uform samplg followed b extrapolato) A optmal outler-dex R O R,C,τ) s defed as a sub-relato R O R: R O τ εr\r O ) m R R, R τ {εr\r )}

Outler Selecto: Theorem 2 Cosder a multset R { 1, 2,, } where the s are sorted order Let R O R be the subset such that: R O τ SR\RO) m R R, R τ {SR\R )} The exsts some 0 τ τ such that R O { 1 τ } { +τ +1-τ) }

Outler Selecto: Algorthm 1) Read the values colum C of the relato R Let { 1, 2,, } be the sorted order of the values appearg C each value correspods to a tuple) 2) For 1 to τ+1, compute E) S{, +1,, -τ+-1 }) 3) Let be the value of where E) takes ts mmum value The the outler-dex s the tuples that correspod to the set of values { j 1 j τ } { j +τ +1-τ) j } where τ -1 The algorthm depeds o computg stadard devatos Stadard devatos computed O1) tme for sertos ad deletos eg E+1) ca be computed from E), ad -τ+1 )

Outler Selecto: Example Relato R _ Y 1099 Y 109,900 10,000 tuples 99% 9900 tuples 1% 100 tuples 1000 1 E1) Outlers!!! For τ 100: E1) 999 E2) 1409 E3) 1725 E101) 999 E2) E101) CREATE VIEW c_otl_dx AS SELECT * from R WHERE C > 1000)

Low Selectvt ad Small Groups Effect Example Relato R Sample Quer wth group-b s Sample ma ot cota eve a sgle row that belogs to the sub-relato Quer wth low selectvt Sample ma ot cota eve a sgle row selected b the quer

Soluto 2: Explotg Workload Iformato To hadle low selectvt ad small groups The dea: Use weghted samplg Sample more from subsets of data that are small sze but are mportat have hgh usage) Explot DB access patter localt Usg pre-computed samples

Explotg Workload Iformato Steps: 1) Workload Collecto: obta a workload cosstg of represetatve queres agast the DB eg Mcrosoft SQL Server Profler) 2) Trace Quer Patters: aalze workload to obta parsed formato eg the set of selecto codtos that are posed) 3) Trace Tuple Usage: The executo of the workload reveals addtoal formato o usage of specfc tuples eg frequec of access to each tuple) Sce trackg ths formato at the level of tuples ca be expesve, t ca be kept at coarser graulart eg o page-level) For the expermets, assumed that a tuple t has weght w f the tuple t s requred to aswer w queres the workload) 4) Weghted Samplg: Perform samplg b takg to accout weghts of tuples step 3 The probablt to accept the sample s p w, where: w ' w / w eed to store the ormalzed weght w together wth the tuple sce ts verse multplcato factor) wll be used to aswer the aggregate quer j 1 j

Explotg Workload Iformato Whe weghted samplg based o workload formato works well? Access patter of queres are local We have a workload that s a good represetatve of future queres

Expermetal Setup Platform: Dell Precso 610 sstem wth a Petum III Xeo 450 MHz processor wth 128 MB RAM ad a exteral 23GB hard drve Databases: 100MB TPC-R databases TPC-R bechmark modfed to var the degree of skew determed b the Zpfa parameter z 5 dstrbuto, sce orgal data s geerated from a uform dstrbuto Workloads: radom quer geerato program wth sum aggregate fucto Parameters: a) skew of the data z) was vared over 1, 15, 2, 25, ad 3 b) the samplg fracto f) was vared over a wde rage from 1% to 100%, c) the storage for the outler-dex was vared over 1%, 5%, 10%, ad 20%; ad d) average over 3 rus Techques:USAMP: uform samplg WSAMP: weghted samplg WSAMP+OTLIDX: weghted samplg + outler-dexg

Expermetal Results

Expermetal Results

Expermetal Results

Questos? Thak ou!