Summarizing Measured Data

Similar documents
Summarizing Measured Data

Summarizing Measured Data

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

SUMMARIZING MEASURED DATA. Gaia Maselli

Comparing Systems Using Sample Data

CS 147: Computer Systems Performance Analysis

Introduction to statistics

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

CS 700: Quantitative Methods & Experimental Design in Computer Science

Chapter 4: Continuous Random Variables and Probability Distributions

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

II. The Normal Distribution

Measurement & Performance

Measurement & Performance

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

STAT Chapter 5 Continuous Distributions

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Two-Factor Full Factorial Design with Replications

IV. The Normal Distribution

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 11

Descriptive Univariate Statistics and Bivariate Correlation

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

Module 3. Function of a Random Variable and its distribution

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Learning Objectives for Stat 225

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Preliminary Statistics. Lecture 3: Probability Models and Distributions

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

CIVL 7012/8012. Collection and Analysis of Information

IV. The Normal Distribution

Quelques éléments pour l expérimentation en informatique. Mescal

Lecture 2. Descriptive Statistics: Measures of Center

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

STAT 200 Chapter 1 Looking at Data - Distributions

1. Exploratory Data Analysis

P8130: Biostatistical Methods I

Chapter 2 Descriptive Statistics

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

BNG 495 Capstone Design. Descriptive Statistics

STATISTICS 1 REVISION NOTES

Chapter 1 - Lecture 3 Measures of Location

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 18

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

Outline. Simulation of a Single-Server Queueing System. EEC 686/785 Modeling & Performance Evaluation of Computer Systems.

MATH4427 Notebook 4 Fall Semester 2017/2018

Counting principles, including permutations and combinations.

Northwestern University Department of Electrical Engineering and Computer Science

BIOS 2041: Introduction to Statistical Methods

Computer Architecture

Network Simulation Chapter 5: Traffic Modeling. Chapter Overview

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

College Mathematics

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Independent Events. Two events are independent if knowing that one occurs does not change the probability of the other occurring

Introduction to Probability

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Stat 20: Intro to Probability and Statistics

ECE Homework Set 2

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

STAT100 Elementary Statistics and Probability

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STAT 418: Probability and Stochastic Processes

CH.8 Statistical Intervals for a Single Sample

Continuous Expectation and Variance, the Law of Large Numbers, and the Central Limit Theorem Spring 2014

Brief Review of Probability

Continuous random variables

a table or a graph or an equation.

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Class 11 Maths Chapter 15. Statistics

Unit 2. Describing Data: Numerical

ICS 233 Computer Architecture & Assembly Language

Chapter 1 Descriptive Statistics

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Lecture Notes 2 Random Variables. Discrete Random Variables: Probability mass function (pmf)

Chapter 3. Data Description

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Slides 8: Statistical Models in Simulation

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /13/2016 1/33

The normal distribution

Closed book and notes. 120 minutes. Cover page, five pages of exam. No calculators.

11/16/2017. Chapter. Copyright 2009 by The McGraw-Hill Companies, Inc. 7-2

ECE 313 Probability with Engineering Applications Fall 2000

EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

Computer Systems Modelling

MEASURES OF LOCATION AND SPREAD

Network Simulation Chapter 6: Output Data Analysis

MATH Notebook 5 Fall 2018/2019

Prof. Thistleton MAT 505 Introduction to Probability Lecture 13

Basics of Stochastic Modeling: Part II

Brief reminder on statistics

Analysis of Experimental Designs

Lecture 2: Metrics to Evaluate Systems

Transcription:

Summarizing Measured Data Dr. John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu COMP 528 Lecture 7 3 February 2005

Goals for Today Finish discussion of Normal Distribution and its properties Finish material on summarizing measured data Solve a problem using PMF 2

Normal Distribution N(µ,σ) most commonly used distribution in data analysis pdf = f (x) = 1 " 2# e$(x$µ ) 2 / 2" 2,$% & x & % µ = mean σ = std dev (also known as a Gaussian distribution) N(µ=0,σ=1) unit normal distribution pdf(x) f (x) = 1 2" e#x 2 / 2 3

Quantile, Percentile, Median & Mode α -quantile: the x value at which the CDF takes value α denoted as x α P(x " x # ) = F(x # ) = # 100α -percentile: the x value at which the CDF takes value α Median = 50-percentile =.5-quantile Mode = most likely value for a discrete variable, the x i that has the highest probability for a continuous variable, the x where pdf is maximum 4

Quantiles of the Normal Distribution z α : α -quantile of the unit normal variate z ~ N(0,1) If x has a normal distribution: x ~ N(µ,σ) PDF N(0,1) P( x " µ # $ z a ) = % or equivalently, CDF N(0,1) P(x " µ + #z a ) = $.8-quantile, 80-percentile.5-quantile, 50-percentile 5

Properties of the Normal Distribution Linearity sum of n independent normal variates is a normal variate if x i ~ N(µ i, σ i ), then x = " n a i x i=1 i has a normal distribution with mean and variance µ = " n a i µ i=1 i n # " 2 = a i i=1 2 2 µ i 6

Central Limit Theorem Sum of a large number of independent observations from any distribution tends to have a normal distribution true for observations from all distributions thus, experimental errors, which arise from many factors, are modeled with a normal distribution 7

Means and Their Uses 8

Arithmetic Mean arithmetic mean of values {x 1,x 2,,x n } x = 1 n " n x i=1 i Caution: arithmetic mean is not always appropriate index of central tendency Is data categorical? no Is total of interest? no Is distribution skewed? no use mean use mode use mean use median Median = 50th percentile value Mode = most frequent e.g. most frequent destination for packets 9

Common Misuses of Arithmetic Means Mean of significantly different values correct index, but useless nonetheless not useful: mean CPU time is 505ms when values are 10ms and 1000ms Using mean without considering skew if variability is too large, mean may not be a representative value e.g. mean({5,5,5,4,31}) = 10 : typical value is 5, mean is useless Multiplying arithmetic means to get the mean of a product the mean of a product of random variables is only equal to the product of the means if values of the variables are independent 10

Geometric Mean Geometric mean of a sample {x 1, x 2,, x n } x = n " x i i=1 Arithmetic mean vs. geometric mean geometric: if product of terms is of interest arithmetic: if sum of observations is of interest Examples of metrics that work in a multiplicative manner cache miss ratios over several levels of cache L3misses = Loads * L1missrate * L2missrate * L3missrate Avg miss rate per level = (L1missrate * L2missrate * L3missrate) 1/3 percentage improvement between successive versions average error rate per hop in multi-hop network # % $ & ( ' 1/ n 11

Harmonic Mean Harmonic mean of a sample {x 1, x 2,, x n } x = Use whenever an arithmetic mean can be justified for 1/x i Example: MIPS rate suppose benchmark has m million instructions MIPS rate x i from ith repetition is m/t i avg. time: use arith. mean, since avg. time has physical meaning avg MIPS for multiple runs of one benchmark: harmonic mean n 1/ x 1 +1/ x 2 +...+1/ x n (sum of 1/x i has physical meaning) x = 1 m /t1 n + 1 m /t 2 +...+ 1 m /t n = m (1/n)(t 1 + t 2 +...+ t n ) 12

Mean of Ratios Problem: given a set of n ratios, summarize them as a single number Example summarize MIPS rate for a processor for different workloads harmonic mean unsuitable " has no meaning Approach: i t i /m i consider additivity of numerators and denominators separately 13

Rules for Means of Ratios - I If numerator and denominator each have meaning compute average of ratios as ratio of averages e.g. average MIPS for different workloads Average( m 1 t 1, m 2 t 2,..., m n t n ) = e.g. mean CPU utilization = If denominator is a constant and numerator has meaning " e.g. resource utilization per constant interval (page faults over one hour intervals) i= n m i=1 i i= n t i=1 i " = m t Average( p 1 t, p 2 t,..., p n t ) = sum of CPU busy times sum of measurement durations " i= n p i=1 i nt 14

Rules for Means of Ratios - II If numerator is constant and denominator has meaning harmonic mean of the ratios should be used to summarize them e.g. computing mean MIPS rate for processor using n observations of same benchmark Average( m t 1, m t 2,..., m t n ) = If numerator and denominator ~ follow multiplicative property i.e. a i = cb i, where c is approximately a constant being estimated estimate c from geometric mean of a i /b i n t 1 /m + t 2 /m +...+ t n /m = nm " n t i=1 i 15

SPEC Metrics? The elapsed time in seconds for each of the benchmarks in the CINT2000 or CFP2000 suite is given and the ratio to the reference machine (Sun Ultra 10) is calculated. How should one compute a summary ratio? The SPECint_base2000 and SPECfp_base2000 metrics are calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from an odd number of runs, greater than or equal to 3. 16

Code Size Optimization with a GA Cooper, Schielke and Subrarnanian, LCTES 99 How should one compute a summary ratio? 17

Summarizing Variability 18

Selecting the Index of Dispersion Is the Distribution Bounded yes Use range Is the Distribution unimodal, symmetrical yes Use C.O.V. use percentiles or SIQR 19

Determining Distribution of Data Can summarize data by its average variability More complete summary: type of distribution e.g. number of I/O calls uniformly distributed 1-25 more meaningful than mean 13, variance is 48 Distribution useful for simulation or analytical modeling How to determine distribution? determine range, divide into cells, plot histogram of observations guideline: if cell has < 5 observations, increase cell size or use variable cell size histogram quantile-quantile plot 20

Quantile-Quantile Plots Compare observed quantiles with those of theoretical distribution Suppose y (j) is the observed α j quantile sort observations, α quantile is x [α(n-1)+1] Use the theoretical distribution to compute α j quantile x j to determine x j, need to invert CDF: α j = F(x j ); then x j = F-1 (α j ) if CDF is invertible, then great! if not, use tables and interpolate, or compute iteratively Plot (x j, y (j) ) If the observations come from the theoretical distribution, the quantile-quantile plot will be linear 21

Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} j α j = (j-.5)/n y j x j 1 1/16 =.0625 -.19 2 3/16 =.1875 -.14 3 5/16 =.3125 -.09 4 7/16 =.4375 -.04 5 9/16 =.5625.04 6 11/16 =.6875.09 7 8 13/16 =.8125 15/16 =.9375.14.19 CDF for N(0,1) 22

Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} x j =4.91[α j 0.14 -(1- α j ) 0.14 ] approximates inversion of N(0,1) j 1 2 3 4 5 6 7 8 α j = (j-.5)/n 1/16 =.0625 3/16 =.1875 5/16 =.3125 7/16 =.4375 9/16 =.5625 11/16 =.6875 13/16 =.8125 15/16 =.9375 y j -.19 -.14 -.09 -.04.04.09.14.19 x j -1.535 -.885 -.487 -.157.157.487.885.1535 CDF for N(0,1) 23

Using Quantile-Quantile Plots Difference between measured and predicted values on a system is modeling error Modeling error for 8 predictions {-.04,-.19,.14,-.09,-.14,.19,.09,.04} j 1 α j = j-.5/n.0625 y i -.19 x i -1.535 2.1875 -.14 -.885 3.3125 -.09 -.487 4.4375 -.04 -.157 5.5625.04.157 6.6875.09.487 7.8125.14.885 8.9375.19.1535 24

Interpreting Normal Quantile-Quantile Plots Normal Long tails Assymmetric Short tails 25

Working with PMF Traffic arriving at a gateway is bursty. The burst size is distributed geometrically with the following PMF f (x) = (1" p) x"1 p x = 1, 2,, Compute the mean burst size Compute the variance of the burst size Compute the standard deviation of the burst size 26