Summarizing Measured Data Dr. John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu COMP 528 Lecture 7 3 February 2005
Goals for Today Finish discussion of Normal Distribution and its properties Finish material on summarizing measured data Solve a problem using PMF 2
Normal Distribution N(µ,σ) most commonly used distribution in data analysis pdf = f (x) = 1 " 2# e$(x$µ ) 2 / 2" 2,$% & x & % µ = mean σ = std dev (also known as a Gaussian distribution) N(µ=0,σ=1) unit normal distribution pdf(x) f (x) = 1 2" e#x 2 / 2 3
Quantile, Percentile, Median & Mode α -quantile: the x value at which the CDF takes value α denoted as x α P(x " x # ) = F(x # ) = # 100α -percentile: the x value at which the CDF takes value α Median = 50-percentile =.5-quantile Mode = most likely value for a discrete variable, the x i that has the highest probability for a continuous variable, the x where pdf is maximum 4
Quantiles of the Normal Distribution z α : α -quantile of the unit normal variate z ~ N(0,1) If x has a normal distribution: x ~ N(µ,σ) PDF N(0,1) P( x " µ # $ z a ) = % or equivalently, CDF N(0,1) P(x " µ + #z a ) = $.8-quantile, 80-percentile.5-quantile, 50-percentile 5
Properties of the Normal Distribution Linearity sum of n independent normal variates is a normal variate if x i ~ N(µ i, σ i ), then x = " n a i x i=1 i has a normal distribution with mean and variance µ = " n a i µ i=1 i n # " 2 = a i i=1 2 2 µ i 6
Central Limit Theorem Sum of a large number of independent observations from any distribution tends to have a normal distribution true for observations from all distributions thus, experimental errors, which arise from many factors, are modeled with a normal distribution 7
Means and Their Uses 8
Arithmetic Mean arithmetic mean of values {x 1,x 2,,x n } x = 1 n " n x i=1 i Caution: arithmetic mean is not always appropriate index of central tendency Is data categorical? no Is total of interest? no Is distribution skewed? no use mean use mode use mean use median Median = 50th percentile value Mode = most frequent e.g. most frequent destination for packets 9
Common Misuses of Arithmetic Means Mean of significantly different values correct index, but useless nonetheless not useful: mean CPU time is 505ms when values are 10ms and 1000ms Using mean without considering skew if variability is too large, mean may not be a representative value e.g. mean({5,5,5,4,31}) = 10 : typical value is 5, mean is useless Multiplying arithmetic means to get the mean of a product the mean of a product of random variables is only equal to the product of the means if values of the variables are independent 10
Geometric Mean Geometric mean of a sample {x 1, x 2,, x n } x = n " x i i=1 Arithmetic mean vs. geometric mean geometric: if product of terms is of interest arithmetic: if sum of observations is of interest Examples of metrics that work in a multiplicative manner cache miss ratios over several levels of cache L3misses = Loads * L1missrate * L2missrate * L3missrate Avg miss rate per level = (L1missrate * L2missrate * L3missrate) 1/3 percentage improvement between successive versions average error rate per hop in multi-hop network # % $ & ( ' 1/ n 11
Harmonic Mean Harmonic mean of a sample {x 1, x 2,, x n } x = Use whenever an arithmetic mean can be justified for 1/x i Example: MIPS rate suppose benchmark has m million instructions MIPS rate x i from ith repetition is m/t i avg. time: use arith. mean, since avg. time has physical meaning avg MIPS for multiple runs of one benchmark: harmonic mean n 1/ x 1 +1/ x 2 +...+1/ x n (sum of 1/x i has physical meaning) x = 1 m /t1 n + 1 m /t 2 +...+ 1 m /t n = m (1/n)(t 1 + t 2 +...+ t n ) 12
Mean of Ratios Problem: given a set of n ratios, summarize them as a single number Example summarize MIPS rate for a processor for different workloads harmonic mean unsuitable " has no meaning Approach: i t i /m i consider additivity of numerators and denominators separately 13
Rules for Means of Ratios - I If numerator and denominator each have meaning compute average of ratios as ratio of averages e.g. average MIPS for different workloads Average( m 1 t 1, m 2 t 2,..., m n t n ) = e.g. mean CPU utilization = If denominator is a constant and numerator has meaning " e.g. resource utilization per constant interval (page faults over one hour intervals) i= n m i=1 i i= n t i=1 i " = m t Average( p 1 t, p 2 t,..., p n t ) = sum of CPU busy times sum of measurement durations " i= n p i=1 i nt 14
Rules for Means of Ratios - II If numerator is constant and denominator has meaning harmonic mean of the ratios should be used to summarize them e.g. computing mean MIPS rate for processor using n observations of same benchmark Average( m t 1, m t 2,..., m t n ) = If numerator and denominator ~ follow multiplicative property i.e. a i = cb i, where c is approximately a constant being estimated estimate c from geometric mean of a i /b i n t 1 /m + t 2 /m +...+ t n /m = nm " n t i=1 i 15
SPEC Metrics? The elapsed time in seconds for each of the benchmarks in the CINT2000 or CFP2000 suite is given and the ratio to the reference machine (Sun Ultra 10) is calculated. How should one compute a summary ratio? The SPECint_base2000 and SPECfp_base2000 metrics are calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from an odd number of runs, greater than or equal to 3. 16
Code Size Optimization with a GA Cooper, Schielke and Subrarnanian, LCTES 99 How should one compute a summary ratio? 17
Summarizing Variability 18
Selecting the Index of Dispersion Is the Distribution Bounded yes Use range Is the Distribution unimodal, symmetrical yes Use C.O.V. use percentiles or SIQR 19
Determining Distribution of Data Can summarize data by its average variability More complete summary: type of distribution e.g. number of I/O calls uniformly distributed 1-25 more meaningful than mean 13, variance is 48 Distribution useful for simulation or analytical modeling How to determine distribution? determine range, divide into cells, plot histogram of observations guideline: if cell has < 5 observations, increase cell size or use variable cell size histogram quantile-quantile plot 20
Quantile-Quantile Plots Compare observed quantiles with those of theoretical distribution Suppose y (j) is the observed α j quantile sort observations, α quantile is x [α(n-1)+1] Use the theoretical distribution to compute α j quantile x j to determine x j, need to invert CDF: α j = F(x j ); then x j = F-1 (α j ) if CDF is invertible, then great! if not, use tables and interpolate, or compute iteratively Plot (x j, y (j) ) If the observations come from the theoretical distribution, the quantile-quantile plot will be linear 21
Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} j α j = (j-.5)/n y j x j 1 1/16 =.0625 -.19 2 3/16 =.1875 -.14 3 5/16 =.3125 -.09 4 7/16 =.4375 -.04 5 9/16 =.5625.04 6 11/16 =.6875.09 7 8 13/16 =.8125 15/16 =.9375.14.19 CDF for N(0,1) 22
Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} x j =4.91[α j 0.14 -(1- α j ) 0.14 ] approximates inversion of N(0,1) j 1 2 3 4 5 6 7 8 α j = (j-.5)/n 1/16 =.0625 3/16 =.1875 5/16 =.3125 7/16 =.4375 9/16 =.5625 11/16 =.6875 13/16 =.8125 15/16 =.9375 y j -.19 -.14 -.09 -.04.04.09.14.19 x j -1.535 -.885 -.487 -.157.157.487.885.1535 CDF for N(0,1) 23
Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} j 1 α j = j-.5/n.0625 y i -.19 x i -1.535 2.1875 -.14 -.885 3.3125 -.09 -.487 4.4375 -.04 -.157 5.5625.04.157 6.6875.09.487 7.8125.14.885 8.9375.19.1535 24
Interpreting Normal Quantile-Quantile Plots Normal Long tails Assymmetric Short tails 25
Working with PMF Traffic arriving at a gateway is bursty. The burst size is distributed geometrically with the following PMF f (x) = (1" p) x"1 p x = 1, 2,, Compute the mean burst size Compute the variance of the burst size Compute the standard deviation of the burst size 26