SUMMARIZING MEASURED DATA. Gaia Maselli

Similar documents
Summarizing Measured Data

Summarizing Measured Data

CS 147: Computer Systems Performance Analysis

Summarizing Measured Data

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Unit 2. Describing Data: Numerical

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Describing distributions with numbers

1. Exploratory Data Analysis

Introduction to Statistics

Chapter 3. Measuring data

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Descriptive Data Summarization

Chapter 3. Data Description

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Learning Objectives for Stat 225

P8130: Biostatistical Methods I

Statistics I Chapter 2: Univariate data analysis

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

Essentials of Statistics and Probability

Statistics I Chapter 2: Univariate data analysis

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Chapter 4. Displaying and Summarizing. Quantitative Data

2011 Pearson Education, Inc

After completing this chapter, you should be able to:

Introduction to statistics

Chapter 2 Descriptive Statistics

Section 3. Measures of Variation

Describing distributions with numbers

Quantitative Tools for Research

CIVL 7012/8012. Collection and Analysis of Information

Topic-1 Describing Data with Numerical Measures

Chapter 1. Looking at Data

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Descriptive Statistics

MATH4427 Notebook 4 Fall Semester 2017/2018

Sets and Set notation. Algebra 2 Unit 8 Notes

Probability Distribution

STAT 200 Chapter 1 Looking at Data - Distributions

Preliminary Statistics course. Lecture 1: Descriptive Statistics

BNG 495 Capstone Design. Descriptive Statistics

Descriptive Statistics

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 1 - Lecture 3 Measures of Location

Descriptive statistics

STATISTICS. 1. Measures of Central Tendency

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

Elementary Statistics

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

Math 14 Lecture Notes Ch Percentile

MgtOp 215 Chapter 3 Dr. Ahn

MEASURES OF LOCATION AND SPREAD

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

3.1 Measure of Center

a table or a graph or an equation.

Lecture 1 : Basic Statistical Measures

TOPIC: Descriptive Statistics Single Variable

Chapter 3 Data Description

Network Simulation Chapter 5: Traffic Modeling. Chapter Overview

Quartiles, Deciles, and Percentiles

CS 5014: Research Methods in Computer Science. Statistics: The Basic Idea. Statistics Questions (1) Statistics Questions (2) Clifford A.

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

MAT Mathematics in Today's World

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Foundations of Algebra/Algebra/Math I Curriculum Map

Counting principles, including permutations and combinations.

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Week 1: Intro to R and EDA


MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Class 11 Maths Chapter 15. Statistics

Descriptive Univariate Statistics and Bivariate Correlation

REVIEW: Midterm Exam. Spring 2012

Chapter 1:Descriptive statistics

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Measures of Central Tendency

STATISTICS 1 REVISION NOTES

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Unit 2: Numerical Descriptive Measures

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

A is one of the categories into which qualitative data can be classified.

21 ST CENTURY LEARNING CURRICULUM FRAMEWORK PERFORMANCE RUBRICS FOR MATHEMATICS PRE-CALCULUS

Introduction to Statistics

Σ x i. Sigma Notation

MATH 117 Statistical Methods for Management I Chapter Three

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 3.1-1

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Describing Distributions with Numbers

Measurement And Uncertainty

Describing Distributions with Numbers

Transcription:

SUMMARIZING MEASURED DATA Gaia Maselli maselli@di.uniroma1.it

Computer Network Performance 2 Overview Basic concepts Summarizing measured data Summarizing data by a single number Summarizing variability Determining distribution of data

Computer Network Performance 3 Motivation A measurement project may result in several hundreds or millions of observations on a given variable. To present the measurements it is necessary to summarize data How to report the performance as a single number? Is specifying the mean the correct way? How to report the variability of measured quantities? What are the alternatives to variance and when are they appropriate?

Computer Network Performance 4 Basic concepts of probability and statistics Independent Events: Two events are called independent if the occurrence of one event does not in any way affect the probability of the other event. Ø Examples: Successive throws of a coin Random Variable: A variable is called a random variable if it takes one of a specified set of values with a specified probability. The outcome of a random event or experiment that yields a numeric value is a random variable Ø Examples: execution time

Computer Network Performance 5 CDF, PDF, and PMF Cumulative Distribution Function (CDF) of a random variable maps a given value a to the probability of the variable taking a value less than or equal to a (Starts at 0. Ends at 1) For a random variable x, Distribution Function: St ensity Function: Starts a Probability Density Function (PDF): Starts at 0 and ends at 0 nsity Function: Sta

Computer Network Performance 6 CDF, PDF, and PMF (Cont) Given a probability density function f(x) the probability of x being in the interval (x 1,x 2 ) can be computed Probability Mass Function (PMF): For discrete random variables: variables: f(x i) x i

Computer Network Performance 7 Mean, Variance, CoV Mean or Expected Value: Variance: The expected value of the square of distance between x and its mean Coefficient of Variation: Standard deviation

Computer Network Performance 8 Covariance and Correlation Covariance: For independent variables, the covariance is zero: Although independence always implies zero covariance, the reverse is not true. Correlation Coefficient: normalized value of covariance The correlation always lies between -1 and +1.

Computer Network Performance 9 Describing a set of measurements How can we summarize measured data? Numerical data (i.e., numbers) Central tendency è summarizing data by a single number Dispersion è summarizing variability Histograms è determining distribution of data Cumulative Distribution Function è most detailed view Categorical (symbolic) data (e.g., symbols, names, etc.) Histograms

Computer Network Performance 10 Summarizing data by a single number: central tendency

Computer Network Performance 11 Central tendency: empirical mean The simplest description of numerical data Given a dataset {x i, i=1,,n} central tendency tells us where on the number line the values tend to be located. Empirical mean (average) is the most widely used measure of central tendency. It is obtained by taking the sum of all observations and dividing this sum by the number of observations in the sample.

Computer Network Performance 12 Central tendency: median and mode Other very simple descriptions of numerical data Median: (value which divides the sorted dataset into two equal parts) is obtained by sorting the observations in an increasing order and taking the observation that is in the middle of the series. If the number of observations is even, the mean of the middle two values is used as a median. Mode (most common/likely value) It is obtained by plotting a histogram and specifying the midpoint of the bucket where the histogram peaks. Mean and median always exist and are unique. Mode may not exist.

Computer Network Performance 13 Mean, median and mode: relationships

Computer Network Performance 14 Observations The main problem with the mean is that it is affected more by outliers than median and mode A single outlier can make a considerable change in the mean (this is particularly true for small samples) Median and mode are resistant to several outlying observations The mean gives equal weight to each observation and in this sense makes full use of the sample Median and mode ignore a lot of information The mean has an additivity or linearity property in that the mean of a sum is a sum of the means. This does not apply to the mode or median

Computer Network Performance 15 Selecting mean, median, and mode Guidelines to select a proper index of central tendency Type of variable Is the total of all observations of any interest? In case of small samples if the ratio of the maximum and minimum of the observations is large, the data is skewed

Computer Network Performance 16 Examples Examples of selections of indices of central tendency Most used resource in a system: Resources are categorical and hence the mode must be used Inter arrival time: total time is of interest and so the mean is the proper choice Load on a computer: the median is preferable due to highly skewed distribution

Computer Network Performance 17 Common misuses of means Using mean of significantly different values When the mean is the correct index of central tendency for a variable, it does not automatically imply that a mean of any set of the variable will be useful Usefulness depends upon the number of values and the variance, not only on the type of the variable Example: it is not very useful to say that the mean packet latency is 2,501s when the two measurements come out to be 2ms and 5s. An analysis based on 2,501s would lead nowhere close to the two possibilities (the mean is the correct index but is useless)

Computer Network Performance 18 Common misuses of means (Cont) Using mean without regard to skewness of distribution Both systems have mean response times of 10 System A: it is useful to know the mean because the variance is low and 10 is the typical value System B: the typical value is 5; hence using 10 for the mean does not give any useful result. The variability is too large in this case

Computer Network Performance 19 Common misuses of means (Cont) Multiplying means to get the mean of a product If the variable are correlated (not independent) Taking a mean of a ratio with different bases

Computer Network Performance 20 Geometric mean The geometric mean Geometric of n values xme 1,x 2,,x n is obtained by multiplying the values together and taking the nth root of the product n is used if the product o Example: The performance improvements in 7 layers. What is the average improvement per layer?

Computer Network Performance 21 Geometric mean (Cont) Example: The performance improvements in 7 layers. What is the average improvement per layer?

Computer Network Performance 22 Summarizing variability: dispersion

Computer Network Performance 23 Summarizing variability Given a data set, summarizing it by a single number is rarely enough It is important to include a statement about its variability in any summary of the data Motivation: given two systems with the same mean performance, one would generally prefer one whose performance does not vary much from the mean Systems with low variability are preferred Variability is specified using one of the following measures which are called indices of dispersion Range-minimum and maximum of the values observed Variance and standard deviation Percentiles and quantiles Mean absolute variation

Computer Network Performance 24 Example Histograms of response time of two systems Both systems have the same mean response time of 2 seconds but a. The response time is close to the mean b. The response time can be 1 ms and 1 minute

Computer Network Performance 25 Range The range of a stream of values can be easily calculated by keeping track of the minimum and the maximum Range = Max-Min Larger range => higher variability In most cases, range is not very useful. The minimum often comes out to be zero and the maximum comes out to be an ``outlier'' far from typical values. Unless the variable is bounded, the maximum goes on increasing with the number of observations, the minimum goes on decreasing with the number of observations, and there is no stable'' point that gives a good indication of the actual range. Range is useful if, and only if, there is a reason to believe that the variable is bounded.

Computer Network Performance 26 Variance and coefficient of variation Variance is measured in squared units; the square root of variance is standard deviation, which is expressed in the same units as the data Variance is expressed in units that are the square of the units of the observations Standard deviation is preferred Coefficient of variation: dimensionless measure of the dispersion of a dataset

Computer Network Performance 27 Quantile and percentile A more detailed description of dataset dispersion is in terms of quantiles and percentile The pth quantile is the value below which the fraction p of the values lie The median is the 0.5 quantile This can also be expressed as a percentile, e.g., the 90 th percentile is the value that is larger than 90% of the data Specifying the 5-percentile and the 95-percentile of a variable has the same impact as specifying its minimum and maximum

Computer Network Performance 28 Quantile calculation The α-quantile can be estimated by sorting the observations and taking the [(n-1)α + 1}-th element in the ordered set.

Computer Network Performance 29 Example In an experiment, which was repeated 32 times, the measured CPU time was found to be {3.1, 4.2, 2.8, 5.1, 2.8, 4.4, 5.6, 3.9, 3.9, 2.7, 4.1, 3.6, 3.1, 4.5, 3.8, 2.9, 3.4, 3.3, 2.8, 4.5, 4.9, 5.3, 1.9, 3.7, 3.2, 4.1, 5.1, 3.2, 3.9, 4.8, 5.9, 4.2}. The sorted set is {1.9, 2.7, 2.8, 2.8, 2.8, 2.9, 3.1, 3.1, 3.2, 3.2, 3.3, 3.4, 3.6, 3.7, 3.8, 3.9, 3.9, 3.9, 4.1, 4.1, 4.2, 4.2, 4.4, 4.5, 4.5, 4.8, 4.9, 5.1, 5.1, 5.3, 5.6, 5.9}. 10-percentile = [ 1+(31)(0.10)] = [4.1] =4th element = 2.8 90-percentile = [ 1+(31)(0.90)] = [28.9] = 29th element = 5.1 First quartile Q 1 = [1+(31)(0.25)] =[8.875]= 9th element = 3.2 Median Q 2 = [ 1+(31)(0.50)] = [16.5] = 0.5(16 th +17 th elements) = 0.5(3.9+3.9) = 3.9 Third quartile Q 1 = [ 1+(31)(0.75)] =[24.25] = 24th element = 4.5 http://www.cse.wustl.edu/~jain/cse567-15/

Computer Network Performance 30 Similar measures Deciles: percetiles at multiples of 10% The first decile is the 10-percentile Quartiles divide the data into four parts at 25, 50 and 75% Second quartile Q 2 is median The range between Q3 and Q1 is called interquartile range of the data. One half of this rage is called Semi-InterQuartile Range (SIQR)

Computer Network Performance 31 Mean absolute deviation The key advantage of the mean absolute deviation over the standard deviation is that no multiplication or square root is required

Computer Network Performance 32 Comparison of variation measures Range is affected considerably by outliers. Sample variance is also affected by outliers but the effect is less than that on the range Mean absolute deviation is next in resistance to outliers. Semi inter-quantile range is very resistant to outliers. If the distribution is highly skewed, outliers are highly likely and SIQR is preferred over standard deviation In general, SIQR is used as an index of dispersion whenever median is used as an index of central tendency. For qualitative (categorical) data, the dispersion can be specified by giving the number of most frequent categories that comprise the given percentile, for instance, top 90%.

Computer Network Performance 33 Selecting the index of dispersion

Computer Network Performance 34 Determining distribution of data: histograms and CDF

Computer Network Performance 35 Histograms Histograms give a more detailed description of measured data and allow to state the distribution the data follows The distribution information is also required if the summary has to be used later in simulation or analytical modeling The histogram is defined in terms of a set of bins (cells or buckets) that are a partition of the observed values To plot a histogram we need to determine the maximum and the minimum of the values and dividing the range into a number of sub-ranges (cells) The count of observations that fall into each cell is determined The count are normalized to cell frequency by dividing by the total number of observations The cell frequency are plotted as a column chart

Computer Network Performance 36 Histograms The histogram counts how many values fall in each bin

Computer Network Performance 37 Histograms A considerable problem in creating histograms is determining the cell sizes Small cells lead to very few observations per cell Large cells result in less variation but the details of the distribution are completely lost Given a set of data it is possible to reach very different conclusions about the distribution shape depending upon the cell size used One guideline is that if any cell has less than five observations, the cell size should be increased or a variable cell histogram should be used

Computer Network Performance 38 Cumulative distribution function (CDF) The most detailed view of a dataset is its empirical cumulative distribution function All the previous methods involve summarization of data in which some detail is lost The CDF potentially provides information about each value in the dataset The CDF is formed by plotting, for each unique value in the dataset, the fraction of data items that are smaller than that value In other words, one plots the quantile corresponding to each unique value in the dataset

Computer Network Performance 39 CDF: example The CDF involves no binning or averaging data values and so provides more information to the viewer than does the histogram

Computer Network Performance 40 Description of categorical (symbolic) data Most of data descriptions do not apply to categorical (symbolic) data The dispersion can be specified by giving the number of most frequent categories that comprise the given percentile, for instance, the top 90% Histograms can be used specifying categories One can measure the empirical probability of each symbol in the dataset The histogram can be plotted in order of decreasing empirical probability