Summarizig Data Daiel A. Meascé, Ph.D. Dept of Computer Sciece George Maso Uiversity Major Properties of Numerical Data Cetral Tedecy: arithmetic mea, geometric mea, media, mode. Variability: rage, iterquartile rage, variace, stadard deviatio, coefficiet of variatio, mea absolute deviatio. Skewess: coefficiet of skewess. Kurtosis
Measures of Cetral Tedecy Arithmetic Mea X Based o all observatios affected by extreme values.! = i= X i greatly 3 Effect of Outliers o Average...4.4.8.8.9.9.3.3.4.4.8.8 3. 3. 3.4 3.4 3.8 3.8 0.3 3.5 Average 3..5 4
Geometric Mea: Geometric Mea & $ % ' = i Used whe the product of the observatios is of iterest. Importat whe multiplicative effects are at play: / Cache hit ratios at several levels of cache Percetage performace improvemets betwee successive versios. Performace improvemets across protocol layers. X i #! " 5 Example of Geometric Mea Test Number Performace Improvemet Operatig System Middleware Applicatio Avg. Performace Improvemet per Layer.8.3.0.7.5.9.5.3 3.0..0.7 4..8..7 5.30.3.5.3 6.4.7.. 7..8.4.8 8.9.9.3.0 9.30..5. 0..5.8.8 Average Performace Improvemet per Layer.0 6 3
Properties of the Geometric Mea & x x # gm( x,..., x) gm $,..., y y! = = % " gm( y,..., y) gm( y / x,..., y / x ) The choice of the base does ot chage the coclusio. Useful for bechmarks x: throughput o target system. y: throughput o base system. 7 Media Middle Value i a Ordered Set of Data. If there are o ties, 50% of the values are smaller tha the media ad 50% are larger....4.4.8.8.9.9.3.3.4.4.8.8 3. 3. 3.4 3.4 3.8 3.8 0.3 3.5 Media.4.4 8 4
Media The media is uaffected by extreme values. Obtaiig the media: Odd-sized samples: X ( +) / Eve-sized samples: X / + X ( / ) + 9 Mode Most frequetly occurrig value. Mode may ot exist. Sigle mode distributios: uimodal. Distributios with two modes: bimodal. uimodal bimodal 0 5
Quatiles (quartiles, percetiles) ad midhige Quartiles: split the data ito quarters. First quartile (Q): value of Xi such that 5% of the observatios are smaller tha Xi. Secod quartile (Q): value of Xi such that 50% of the observatios are smaller tha Xi. Third quartile (Q3): value of Xi such that 75% of the observatios are smaller tha Xi. Percetiles: split the data ito hudredths. Midhige: Q 3 + Q Midhige = Example of Quartiles.05 Q.3.06 Q.8.09 Q3 3.00.9 Midhige.6..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 I Excel: Q=PERCENTILE(<array>,0.5) Q=PERCENTILE(<array>,0.5) Q3=PERCENTILE(<array>,0.75) 6
Example of Percetile.05 80-percetile 3.6300.06.09.9..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 I Excel: p-th percetile=percentile(<array>,p) (0 p ) 3 Rage, Iterquartile Rage, Variace, ad Stadard Deviatio Rage: X max! X mi Iterquartile Rage: Q 3! Q ot affected by extreme values. Variace: "( Xi! X ) s = i=! I Excel: s =VAR(<array>) Stadard Deviatio: I Excel: s=stdev(<array>) s = "( Xi! X ) i=! 4 7
Meaigs of the Variace ad Stadard Deviatio The larger the spread of the data aroud the mea, the larger the variace ad stadard deviatio. If all observatios are the same, the variace ad stadard deviatio are zero. The variace ad stadard deviatio caot be egative. Variace is measured i the square of the uits of the data. Stadard deviatio is measured i the same uits as the data. 5 Coefficiet of Variatio Coefficiet of variatio (COV) : o uits.05 S 9.50.06 Average 9.5.09 COV 3.0.9..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 s / X 6 8
Coefficiet of Skewess Coefficiet of skewess: (X-Xi)^3.05-606..06-60.9.09-596..9-575.. -57.8.8-557.9.34-546.4.34-544.8.77-464.5.80-458..83-453..5-398.9. -388.8.7-379.0.6-38.5.67-30.5.77-306.6.83-98.7 3.5-5.9 3.77-89.6 5.76-5.9 5.78-5. 3.07 476.6 44.9 48007. 3 ) 3!( X i " X s i= 4.033 7 Mea Absolute Deviatio Mea absolute deviatio:! i= abs(xi-xbar).05 8.46 Average 9.5.06 8.45 Mea absolute deviatio 3.6.09 8.4.9 8.3. 8.30.8 8.3.34 8.8.34 8.7.77 7.74.80 7.7.83 7.68.5 7.36. 7.30.7 7.4.6 6.90.67 6.84.77 6.74.83 6.68 3.5 6.00 3.77 5.74 5.76 3.75 5.78 3.73 3.07.56 44.9 35.39 35.90 X i " X 8 9
Shapes of Distributios mode media mea Right-skewed distributio Mode, media, mea Symmetric distributio mode media mea Left-skewed distributio 9 Cofidece Iterval for the Mea The sample mea is a estimate of the populatio mea. Problem: give k samples of the populatio (with k sample meas), get a sigle estimate of the populatio mea. Oly probabilistic statemets ca be made: 0 0
Cofidece Iterval for the Mea Pr[ c # µ # ] c = "! where, ( c, c ) 00 ( "!) "! : cofidece iterval : cofidece level (usually 90 or 95%) : cofidece coefficiet. Cetral Limit Theorem If the observatios i a sample are idepedet ad come from the same populatio that has mea µ ad stadard deviatio σ the the sample mea for large samples has a ormal distributio with mea µ ad stadard deviatio σ/. The stadard deviatio of the sample mea is called the stadard error.
Cetral Limit Theorem Populatio mea = µ Populatio std deviatio = σ Populatio (N values) sample ( values) sample ( values)... sample ( values) x x x M... Average of x,, x M = µ Stadard deviatio of x,, x M = σ /sqrt() 3 Cofidece Iterval 00 (-α)% cofidece iterval for the populatio mea: ( x! " z "! / s /, x + z " / s / ) x : sample mea s: sample stadard deviatio : sample size z : (-α/)-quatile of a uit ormal variate ( N(0,)). "! / 4
Example of Cofidece Iterval Computatio CPU Time (msec) 5.76 4.67 sample mea 4.5 3.77 sample std 7.56.7 alpha 0..83 cof level 90.05 -(alpha/) 0.95.6 z0.95.645 from a Normal Table.06 5.78 c.97 3.5 c 7.04.77.83 With 90% cofidece the populatio mea.77 is i the iterval.97 7.04.9. 4.80.80.34.8..5.09.34 3.07 5 From Excel: Tools > Data Aalysis > Descriptive Statistics Descriptive Statistics (from Excel Aalysis Pack) Mea 9.50589 Stadard Error 6.03 Media.80555 Mode #N/A Stadard Deviatio 9.49833 Sample Variace 870.55 Kurtosis.650 Skewess 4.594 Rage 43.857 Miimum.04793 Maximum 44.905 Sum 8.54 Cout 4 Cofidece Level(95.0%).45604 s 6 3
Box-ad-Whisker Plot Graphical represetatio of data through a five-umber summary. I/O Time (msec) 8.04 9.96 5.68 6.95 8.8 0.84 4.6 4.8 8.33 7.58 7.4 7.46 8.84 5.73 6.77 7. 8.5 5.39 6.4 7.8.74 6.08 Five-umber Summary Miimum 4.6 First Quartile 6.08 Media 7.35 Third Quartile 8.33 Maximum.74 50% of the data lies i the box 4.6 6.08 7.35 8.33.74 7 4