Issues in Metric Selection and the TPC-D Single Stream Power

Similar documents
Section 8.3 Polar Form of Complex Numbers

More metrics on cartesian products

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Chapter 13: Multiple Regression

8.6 The Complex Number System

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Lecture 10 Support Vector Machines II

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

x = , so that calculated

= z 20 z n. (k 20) + 4 z k = 4

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Difference Equations

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

THE SUMMATION NOTATION Ʃ

Kernel Methods and SVMs Extension

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

STATISTICS QUESTIONS. Step by Step Solutions.

Linear Approximation with Regularization and Moving Least Squares

PHYS 705: Classical Mechanics. Calculus of Variations II

Feature Selection: Part 1

Week 9 Chapter 10 Section 1-5

Economics 101. Lecture 4 - Equilibrium and Efficiency

Structure and Drive Paul A. Jensen Copyright July 20, 2003

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Section 3.6 Complex Zeros

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Notes on Frequency Estimation in Data Streams

Global Sensitivity. Tuesday 20 th February, 2018

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture 12: Discrete Laplacian

Limited Dependent Variables

Errors for Linear Systems

SIMPLE LINEAR REGRESSION

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

Temperature. Chapter Heat Engine

APPENDIX A Some Linear Algebra

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Edge Isoperimetric Inequalities

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Chapter 5 Multilevel Models

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

/ n ) are compared. The logic is: if the two

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Economics 130. Lecture 4 Simple Linear Regression Continued

Chapter 3 Describing Data Using Numerical Measures

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Chapter 8 Indicator Variables

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Chapter 9: Statistical Inference and the Relationship between Two Variables

Module 9. Lecture 6. Duality in Assignment Problems

PHYS 705: Classical Mechanics. Newtonian Mechanics

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Assortment Optimization under MNL

x i1 =1 for all i (the constant ).

CHAPTER 14 GENERAL PERTURBATION THEORY

Uncertainty in measurements of power and energy on power networks

An (almost) unbiased estimator for the S-Gini index

Linear Regression Analysis: Terminology and Notation

Unit 5: Quadratic Equations & Functions

Chapter 12 Analysis of Covariance

The Order Relation and Trace Inequalities for. Hermitian Operators

Sampling Theory MODULE VII LECTURE - 23 VARYING PROBABILITY SAMPLING

Integrals and Invariants of Euler-Lagrange Equations

arxiv: v1 [math.ho] 18 May 2008

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Chapter 11: Simple Linear Regression and Correlation

Lecture Notes on Linear Regression

Module 1 : The equation of continuity. Lecture 1: Equation of Continuity

Physics 53. Rotational Motion 3. Sir, I have found you an argument, but I am not obliged to find you an understanding.

Lecture 4. Instructor: Haipeng Luo

NUMERICAL DIFFERENTIATION

Appendix B: Resampling Algorithms

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Foundations of Arithmetic

Linear Feature Engineering 11

Advanced Circuits Topics - Part 1 by Dr. Colton (Fall 2017)

January Examinations 2015

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

How Strong Are Weak Patents? Joseph Farrell and Carl Shapiro. Supplementary Material Licensing Probabilistic Patents to Cournot Oligopolists *

The Second Anti-Mathima on Game Theory

Hopfield Training Rules 1 N

In the figure below, the point d indicates the location of the consumer that is under competition. Transportation costs are given by td.

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Lecture 3 Stat102, Spring 2007

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Chapter Newton s Method

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Basically, if you have a dummy dependent variable you will be estimating a probability.

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Markov Chain Monte Carlo Lecture 6

1 Matrix representations of canonical matrices

Transcription:

Issues n Metrc Selecton and the TPC-D Sngle Stream Power Alan Crolotte AT&T Global Informaton Solutons The purpose of ths paper s to examne some of the ssues faced by benchmark desgners n stuatons where potentally very large and very small numbers need to be aggregated nto a sngle number.we use, as an example, the case of TPC-D and examne the soluton retaned as descrbed n the TPC-D Draft 6.0 []. The paper focuses on basc ssues whch are at the core of the metrc selecton problem and how choces on the basc ssues naturally lead to choces on metrc alternatves. To smplfy the dscusson, a mathematcal appendx (Appendx A) has been provded thus freeng up the text from mathematcs as much as possble. The Problem In a stuaton where test results for a sngle vendor can be dstrbuted over a very wde range, how can one aggregate all the results nto a sngle fgure of mert so that () the underlyng busness model s represented n the metrc and (2), meanngful vendor to vendor comparsons can be performed? Also, we assume that "small s good".e., a low metrc equates to a good score. Ths s the case for the 9 TPC-D queres [] whch represent a large sample of realstc busness questons n a decson-support envronment. For a gven system and a gven database sze, query executon tmes can vary over a wde range, e.g. some queres could take a few seconds whle others could take several mnutes or even hours. Small ndvdual observatons should, of course, equate to a good score. Averagng, n some fashon, the observatons,.e. fndng a characterstc of central tendency, and takng the nverse wll provde such a score. The smplest characterstc of central tendency s the smple mean whch s equal to the sum of all observatons (query tmes for TPC-D) dvded by the number of observatons. The man advantage of usng the smple mean s ts "physcal" sgnfcance. For TPC-D ts nverse s the average number of queres processed per unt of tme. One of the well- known drawbacks of the arthmetc average s ts senstvty to large observatons whch are out of scope. For nstance, f all observatons are wthn the to 0 range an abnormal observaton of 000 could domnate the arthmetc average so much that the resultng value could be meanngless. In ths knd of a stuaton t s customary to resort to the geometrc average. However, the geometrc average s not a unversal panacea as we see n the sequel. For a gven set of n observatons (e.g. TPC-D query tmes) x 's, the smple mean x - s gven by x - x + x +... + x 2 n n and the geometrc average g s gven by the formula g n x x...x (the query power metrc s computed as the nverse of g.) The formula defnng g 2 n looks smple but t s somewhat confusng when one tres to understand ts physcal meanng. By takng the logarthm and bearng n mnd ts elementary propertes (.e and ), the formula for g can be rewrtten as log x + log x +... + log x 2 n log g n Vewng the geometrc average n ths fashon provdes a more ntutve approach to understandng a metrc based on a geometrc average. Ths s a result of the fact that addng s "easer" than multplyng, the very reason why logarthms were nvented. Wth ths formula we see that the geometrc average can be vewed as an average also! Remember that the smple mean s very senstve to large out of scope values. The same apples to the geometrc average except that the senstvty s to very small values of x resultng n very large (negatve) values of logx.

More Averages In summary, so far, we have the arthmetc average whch s too senstve to large values and the geometrc average whch s too senstve to small values. Ths makes the geometrc average senstve to "benchmark specals". A vendor havng found a trcky way to make one observaton extremely small would reap enormous benefts. As we saw earler, the problem wth the geometrc average comes from the "dscontnuty" of the logarthm at the orgn. So long as the observatons are away from zero the geometrc average s well-behaved. To try stayng away from zero let us add a small quantty f to all the x 's and to g whch becomes say g defned f by log (x +f)+ log (x +f)+... + log (x +f) 2 n log (g +f) f n In ths formula f s a fxed quantty ndependent of the observatons. Although ths formula looks dentcal to the "correcton formula" portrayed n paragraph 5.4..2 of the TPC-D Draft 6.0 [], t s dfferent n the sense that, n the Draft, the quantty f s not fxed but a functon of the observatons. Snce there are major drawbacks n usng a value of f whch s not fxed (see Appendx B) we confne ourselves to fxed values of f for the purpose of ths analyss. We have "coned" the term f-dsplaced geometrc average for the quantty defned n the above formula. The purpose of the dsplacement s to guard aganst benchmark specals. But then, what to chose for f? We have retaned two canddates, f/000 and f. The frst choce s remnscent of the max/mn rato n the "correcton formula" and there s a good reason for the second choce explaned n Appendx A. Then, there s the half-way average called ths way because t s a compromse, half-way between the smple mean and geometrc average (agan see Appendx A). The half-way average s s defned by the equaton: x + x +... + x 2 n s n Dmensons of Value The queston we can ask ourselves next s: What are the propertes whch make a metrc "good"? Next, we have defned fve dmensons of value so that we can assess the above defned averages or any other metrc, n terms of these dmensons of value. These are: () Ease To Explan referrng to the amount of dffculty one encounters when explanng the metrc to non-mathematcally orented users - t s related to how ntutve the measure s (everybody understands the smple mean). (2) Meanng refers to the ablty to translate the measure nto somethng usable drectly whle dong data base processng (e.g. a transacton rate). (3) Nonhypersenstvty To Extreme Values whch refers to the propensty of a metrc to be overwhelmed by certan values out of range. (4) Scalablty whch refers to the property of a metrc to be scaled by the same amount as the ndvdual observatons (e.g. f all values are dvded by 2 then the metrc s dvded by 2). (5) Balance whch refers to the property of a metrc to favor a relatve decrease n a large value over the same relatve decrease n a small value. Ease To Explan and Meanng These two dmensons are very closely related. Of all the measures consdered only the smple mean has meanng snce t easly translates nto a number of data base transactons a user could expect to perform n a unt of tme. Of course the actual clent workload may not be represented adequately but ths s a general problem and customers can exercse responsblty and make sure they understand ther workload and how t would affect ther throughput. It wll be dffcult to explan any metrc other than the smple mean. In the case of TPC-D, although all averages examned have the dmenson of a query tme they don't have meanng because they cannot be translated nto a number of transactons one can perform. They can, however, be effectve for summary and comparson purposes. Hypersenstvty to Extreme Values The half-way average s not as senstve as the smple mean to large values. For example, takng {,2,...0} and brngng to 000, the smple mean s multpled by 20 but the half-way average s multpled by 6 whle the geometrc average s multpled by 2 only. For small values the geometrc and the dsplaced geometrc averages behave smlarly. Scalablty Ths property s mportant. All averages examned so far, except the f- dsplaced geometrc average, scale wth the observatons;.e. f all observatons are multpled (or dvded) by a factor K then the average s also multpled (or dvded) by the same factor K. Non scalablty s the major drawback of the -dsplaced geometrc average whch looks so good

otherwse, especally consderng the fact that f all observatons are small then the -dsplaced geometrc average s close to the smple mean. Balance What we have called here a balanced measure s one whch rewards "workng" on large observatons. Of all measures consdered only the unbalanced measure s the geometrc mean. In other words, wth the geometrc mean t does not pay to "work" on the system to brng the larger observatons down because a relatve drop of say 0% n a large observaton wll equate n the same drop for the geometrc average as would a drop of 0% n a larger observaton. Ths was what the TPC-D subcommttee wanted to accomplsh and that s why the TPC-D metrc s based on a geometrc aver- age. Table and the graph under t show a comparson between the smple mean, the new measures and the geometrc average n order to llustrate some of the ponts just mentoned. TABLE. Comparson Between Arthmetc Average, Geometrc Average, f- Dsplaced Geometrc Average and Half-Average For Fve Sets of Hypothetcal Data set set 2 set 3 set 4 set 5 x x 2 x 3 x 4 2 2 2 2 2 x 5 2 2 2 2 2 x 6 2 2 2 2 2 x 7 3 3 3 3 0.003 x 8 3 3 3 0.003 0.003 x 9 3 3 0.003 0.003 0.003 x 0 3 0.003 0.003 0.003 0.003 mean 2..8003.5006.2009 0.902 g.9804.5953.2600 0.96807 0.738 g.905 0.9575 0.4799 0.2405 0.205 g /000.907 0.9850 0.5076 0.263 0.343 s 2.008.5609.699 0.8352 0.5568 2.5 2.5 0.5 mean g s g/000 g 0 set set 2 set 3 set 4 set 5 In set all observatons are of the same order of magntude and the values of the fve statstcs are very smlar. In set 2 an observaton has been brought down artfcally through the use of a "benchmark specal". The net effect on the geometrc average s a dvson by 2 (correspondng to a doublng of the power whch s the nverse) whle the -dsplaced average or the half-way average behave smlarly to the smple mean and decrease moderately. The /000-dsplaced geometrc average behaves also dentcally to the geometrc average. As the number of

observatons whch are brought down ncreases, the hgh senstvty of the geometrc average s exemplfed resultng n doublng the power every tme whle the ncrease usng the -dsplaced geometrc average or the half-way average s moderate, reflectng more of an "addtve" effect. Where do we go from here? Table 2 portrays the summary of the analyss whch took place above. Based on the desred effect, what s mportant and how ths translates nto the above dmensons of value, a choce can be made. The usefulness of what s ds- cussed here s the focus on the real ssues (the dmensons of value). In the case of TPC- D, for example, a major consderaton was the unbalance (N n the balance column). The resultng choce was the geometrc average and the advantages and drawbacks whch come wth t. But, nstead of debatng choces on metrc types, people can debate the real ssues and the relatve mportance of these ssues. Table 2 can then assst n selectng a metrc once the real ssues and the choces on these ssues are clear. My per- sonal choce n a decson support envronment s the half-way average. On one hand I am afrad of benchmark specals because they have a tendency to cast a doubt on the entre benchmark process so I am afrad of the geometrc average n spte of the very good arguments for t. On the other hand, I lke the arthmetc average because t has meanng but t s overwhelmed by large values. The half-way average s rght n between, t s not senstve to extreme values and t scales. Therefore t s the rght choce. TABLE 2. Comparson Summary for Consdered Measures Measure Ease to Meanng Non- Scalablty Balance Explan Hypersenstv ty to Extreme Values current [] N N N N N smple mean Y Y N Y Y XXX N N Y N Y XXX N N N Y N XXX N N N N Y s Y/N N Y Y Y APPENDIX A The ph-average. Gven a set of n observatons x,..., x, and ther assocated weghts or n frequences f,..,f one can defne a gamut of averages. A very broad range of such averages n fall under the general category of ph-averages defned as n [2]: gven a monotonc functon φ the ph-average M s gven by the formula φ n φ( M ) f φ(x ) φ The r-average (Also known as Power Average of Order r). The r-average m s a specal case of φ: x--> x r and s gven by r m r ( f x r ) /r Important subcases are r (the arthmetc average often noted as x - ), r - (the harmonc average), and r 0 whch s a lmt case yeldng the geometrc average (g), and r /2 (the halfway average s). We now proceed to show that h < x - < g. Frst we show that x - < g by notcng that x --> log x s concave and that, therefore log f x f logx

The above equaton merely expresses log g log x - whch mples g x -. By rewrtng the above equaton wth y we obtan yet another way of wrtng log g log h where g and h are the geometrc and harmonc averages of the / x 's. The relatonshp between the harmonc average and the geometrc average can be further exploted to show that the r-average s an ncreasng functon of r. Takng the dervatve wth respect to r n the equaton defnng m yelds r where y x r. dm r dr m r ( r 2 -log f logx f x + f x ) Settng g f y / f x we can further reduce the above equaton to dm r dr m r r 2 ( -log f x + g logy ) Callng H the harmonc average of the / y 's wth weghts g the above equaton can be rewrtten as dm r dr m r r 2 ( log H - g log y ) whch s postve snce the second term nsde the parentheses n the equaton above s log G where G s the geometrc average of the / y 's wth weghts g. Therefore, the dervatve of m r wth respect to r s postve and therefore m r s an ncreasng functon of r. The geometrc average as the 0-average. When r --> 0 we can wrte a r e rlog a + rlog a + o(r) and thus: m r r + rlog m + o(r) f (x ) r f + r f log x + o(r) whch smplfes nto r log m r f log x + o(r) r fnally yeldng log m f log x 0 whch s the defnton of the geometrc average, the ph-average for φ: x --> logx. As a result, the half-way average correspondng to r /2 s half-way between the geometrc average (r 0) and the smple average (r ). The-dsplaced average. Consder a ph-average closely related to the geometrc average namely the a-dsplaced geometrc average defned by log (a + g ) f log (a + x ) a where a s postve. Consderng the -dsplaced geometrc average from the pont of vew of ts mathematcal propertes, we have already notced that t falls nto the category of ph-averages correspondng to the functon x--> log (a + x); but, n ths famly of functons (a beng the parameter), the -dsplaced geometrc average (correspondng to a ) plays a role of anchor because t s the only one for whch the value of the functon s zero when the varable s equal to zero. Ths s why t was retaned as a canddate.

Relatonshp between arthmetc, geometrc and f-dsplaced geometrc averages The geometrc average can be denoted as g for consstency. It could also be denoted as m 0 0 snce t s also the 0-average, and for ths reason, snce we know that the r-average s an ncreasng functon of r, we have g x -. We are now showng that g s between g and x -. 0 a 0 Frst, notce that x --> log (a + x) s concave and that therefore log f (a + x ) f log (a + x ) Hence, log (a + x - ) log (a + g ) and, snce x --> log (a + x) s also monotonc ncreasng, then a x - g and ths establshes the frst part of the nequalty. a To establsh the second part of the nequalty consder the a-dsplaced geometrc average g a defned by log (a + g ) f log (a + x ) a a + g can also be nterpreted as G(y), the geometrc average of the y 's defned by y a + x. a Takng the dervatve wth respect to a n the equaton defnng g yelds a + d da g a a + g a f a + x whch can be rewrtten as G(y) d da g a f y - G(y) Notce that the equaton H(y) f y defnes H(y) the harmonc average of the y 's and that H(y) G(y) (remember that the harmonc average s the (-)-average whle the geometrc average plays the role of 0-average and that the r-average s an ncreasng functon of r.) Therefore, G(y) d da g a H(y) - G(y) 0 And thus, d da g a 0 nsurng that g a s ncreasng and therefore g g. Ths also establshes 0 that Q n [] wth correcton factor ( / g ) s smaller than the value wthout correcton factor ( / p f g ). 0 Varatons A very mportant pont s related to optmzaton. What should a vendor do n order to obtan a better score? In other words, are there any gudelnes to mprove the perfor- mance? The queston here s "how do the varous averages reward decreases n the ndvdual contrbutors dependng on ther relatve sze?". From the defnton of the r- average m r r f x r

we can determne the ncrease n the r-average as a functon of the ncrease n one query tme assumng all the others are equal. Dfferentatng the above equaton one obtans the followng equaton r-?m r x f?x m r- r Ths translates nto whch yelds x r- dm f r m r- r dm dx f x r r m / x r f xr whch lnks the relatve varaton of an ndvdual query tme to the varaton of the measure of central tendency. The term at the left sde of the above equaton s referred to as the "elastcty" n Economcs. Whenever r > 0 we have x > x _ x r 2 > xr 2 and therefore, the elastcty of m r wth respect to x s an ncreasng functon of x. In other words, the bgger x the bgger the relatve decrease of m for a gven relatve decrease d x r / x. Ths s true n partcular for the half-way average correspondng to r /2 (and for the smple mean whch we knew already). It s not true for the geometrc average whch treats equally large and small observatons.for the dsplaced geometrc averages the argument s smlar. Startng wth the defnton of the a-dsplaced geometrc average log (a + g ) a f log (a + x ) and takng the dfferental n both sdes assumng that x only vares dg g x dx a a g a + g f a + x x a a dx Assumng that all the weghts f are equal, notcng that g a a + g a s constant, and that the functon x--> x/ (a + x) s monotonc ncreasng, t s clear that the larger x the larger the relatve ncrease of g for a gven relatve ncrease of x. Therefore the a-dsplaced average s a balanced a measure. APPPENDIX B As we saw, the formula for the f-dsplaced average s log (x + f) + log (x + f) +... + log (x + f) 2 n log (g + f) f n The quantty g defned above s used n the defnton of the power metrc when the rato between f the max and the mn query tme s larger than 000; f s defned as the maxmum query tme dvded by 000. As shown n appendx A, g s always larger than g (the geometrc average) no f matter what the value of s as long as t s postve. Ths property s used to penalze a vendor who would have a "benchmark specal" resultng n one query tme dsproportonally small compared to the other query tmes.

As a result, there are two formulas for the power metrc n []. One s used when the max over mn query tme rato s larger than 000 and the other when the rato s larger than 000. Table 3 shows sample data llustratng the dffcultes assocated wth ths "dual formula" stuaton resultng n a lack of "contnuty". Frst, notce set 2 whch llustrates the pont that the smple mean s very senstve to large out of scope values and set 3 whch llustrates the pont that the geometrc average g s very senstve to small values. Set 4 nvolves a margnal case where the max/mn rato s 000 and thus, the geometrc average s used (value 2.86). In set 3, the max/mn rato s larger than 000 and s used (value 2.88). 2.88 s very close to 2.86 but the problem s that set 3 s "better" than set 4 and yet the power (nverse of the geometrc average) decreases! TABLE 3. Varatons of the Current Metrc for Sample Data set set 2 set 3 set 4 set 5 x 000 0.00 0.0 0.0 x 2 2 2 2 2 2 x 3 3 3 3 3 3 x 4 4 4 4 4 4 x 5 5 5 5 5 5 x 6 6 6 6 6 6 x 7 7 7 7 7 7 x 8 8 8 8 8 8 x 9 9 9 9 9 9 x 0 0 0 0 0 0. mean 5.5 05.4 5.400 5.40 5.4 g 4.528 9.036 2.270 2.857 2.860 g f 2.880 3.060 The stuaton s slghtly dfferent when comparng set 4 and set 5. In set 5 the max/mn s hgher than 000 so g s used wth f 0./000.00 yeldng the value 3.06 compared to set 4 f whch yelds 2.86 wth the geometrc average. Lookng at set 5 n solaton one can see that usng the "corrected" metrc does penalze (3.06 vs. 2.86) but comparng set 5 and set 4 there s a bg drop from set 5 to set 4 (about 7%) but set 4 and set 5 are almost the same. Actually, f we had used to geometrc average, we would have concluded that set 4 and set 5 where about the same (2.86). Indeed, the only real dfference between set 4 and set 5 s that the formula has changed! Based on the recommendaton of ths author the correcton formula has been abandoned by the TPC-D subcommttee. REFERENCES [] TPC BENCHMARK D (Decson Support), Workng Draft 6.0, August, 993, Edted by Franços Raab, Transacton Processng Councl. [2] Calot, G., Cours de Statstque Descrptve, 964, Dunod, Pars