Summarizing Data. Major Properties of Numerical Data

Similar documents
Chapter 2 Descriptive Statistics

Median and IQR The median is the value which divides the ordered data values in half.

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

(# x) 2 n. (" x) 2 = 30 2 = 900. = sum. " x 2 = =174. " x. Chapter 12. Quick math overview. #(x " x ) 2 = # x 2 "

Data Description. Measure of Central Tendency. Data Description. Chapter x i

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2


Probability and statistics: basic terms

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

The Hong Kong University of Science & Technology ISOM551 Introductory Statistics for Business Assignment 3 Suggested Solution

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Chapter 4 - Summarizing Numerical Data

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

Topic 9: Sampling Distributions of Estimators

Parameter, Statistic and Random Samples

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Final Examination Solutions 17/6/2010

MEASURES OF DISPERSION (VARIABILITY)

1 Lesson 6: Measure of Variation

Statistics 511 Additional Materials

Topic 9: Sampling Distributions of Estimators

Sample Size Determination (Two or More Samples)

Computing Confidence Intervals for Sample Data

Topic 9: Sampling Distributions of Estimators

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Linear Regression Models

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Module 1 Fundamentals in statistics

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Regression, Inference, and Model Building

Elementary Statistics

2: Describing Data with Numerical Measures

Formulas and Tables for Gerstman

Properties and Hypothesis Testing

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

NCSS Statistical Software. Tolerance Intervals

Rule of probability. Let A and B be two events (sets of elementary events). 11. If P (AB) = P (A)P (B), then A and B are independent.

MA238 Assignment 4 Solutions (part a)

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 3

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Read through these prior to coming to the test and follow them when you take your test.

1 Inferential Methods for Correlation and Regression Analysis

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Expectation and Variance of a random variable

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Lecture 7: Properties of Random Samples

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Mathacle. PSet Stats, Concepts In Statistics Level Number Name: Date:

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

MIT : Quantitative Reasoning and Statistical Methods for Planning I

Approximate Confidence Interval for the Reciprocal of a Normal Mean with a Known Coefficient of Variation

Topic 10: Introduction to Estimation

CHAPTER SUMMARIES MAT102 Dr J Lubowsky Page 1 of 13 Chapter 1: Introduction to Statistics

STAT 515 fa 2016 Lec Sampling distribution of the mean, part 2 (central limit theorem)

Chapter 6 Sampling Distributions

Statistics 300: Elementary Statistics

Stat 200 -Testing Summary Page 1

Lecture 5. Random variable and distribution of probability

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Joint Probability Distributions and Random Samples. Jointly Distributed Random Variables. Chapter { }

Stat 139 Homework 7 Solutions, Fall 2015

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Stat 421-SP2012 Interval Estimation Section

Agenda: Recap. Lecture. Chapter 12. Homework. Chapt 12 #1, 2, 3 SAS Problems 3 & 4 by hand. Marquette University MATH 4740/MSCS 5740

Correlation and Covariance

Descriptive Statistics

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Chapter 6. Sampling and Estimation

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

Sampling Distributions, Z-Tests, Power

Chapter 7 Student Lecture Notes 7-1

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

TOPIC 6 MEASURES OF VARIATION

Extreme Value Charts and Analysis of Means (ANOM) Based on the Log Logistic Distribution

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 18: Sampling distributions

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

Sampling, Sampling Distribution and Normality

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Lecture 24 Floods and flood frequency

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Output Analysis (2, Chapters 10 &11 Law)

The standard deviation of the mean

Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

Lecture 4. Random variable and distribution of probability

A goodness-of-fit test based on the empirical characteristic function and a comparison of tests for normality

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Transcription:

Summarizig Data Daiel A. Meascé, Ph.D. Dept of Computer Sciece George Maso Uiversity Major Properties of Numerical Data Cetral Tedecy: arithmetic mea, geometric mea, media, mode. Variability: rage, iterquartile rage, variace, stadard deviatio, coefficiet of variatio, mea absolute deviatio. Skewess: coefficiet of skewess. Kurtosis

Measures of Cetral Tedecy Arithmetic Mea X Based o all observatios affected by extreme values.! = i= X i greatly 3 Effect of Outliers o Average...4.4.8.8.9.9.3.3.4.4.8.8 3. 3. 3.4 3.4 3.8 3.8 0.3 3.5 Average 3..5 4

Geometric Mea: Geometric Mea & $ % ' = i Used whe the product of the observatios is of iterest. Importat whe multiplicative effects are at play: / Cache hit ratios at several levels of cache Percetage performace improvemets betwee successive versios. Performace improvemets across protocol layers. X i #! " 5 Example of Geometric Mea Test Number Performace Improvemet Operatig System Middleware Applicatio Avg. Performace Improvemet per Layer.8.3.0.7.5.9.5.3 3.0..0.7 4..8..7 5.30.3.5.3 6.4.7.. 7..8.4.8 8.9.9.3.0 9.30..5. 0..5.8.8 Average Performace Improvemet per Layer.0 6 3

Properties of the Geometric Mea & x x # gm( x,..., x) gm $,..., y y! = = % " gm( y,..., y) gm( y / x,..., y / x ) The choice of the base does ot chage the coclusio. Useful for bechmarks x: throughput o target system. y: throughput o base system. 7 Media Middle Value i a Ordered Set of Data. If there are o ties, 50% of the values are smaller tha the media ad 50% are larger....4.4.8.8.9.9.3.3.4.4.8.8 3. 3. 3.4 3.4 3.8 3.8 0.3 3.5 Media.4.4 8 4

Media The media is uaffected by extreme values. Obtaiig the media: Odd-sized samples: X ( +) / Eve-sized samples: X / + X ( / ) + 9 Mode Most frequetly occurrig value. Mode may ot exist. Sigle mode distributios: uimodal. Distributios with two modes: bimodal. uimodal bimodal 0 5

Quatiles (quartiles, percetiles) ad midhige Quartiles: split the data ito quarters. First quartile (Q): value of Xi such that 5% of the observatios are smaller tha Xi. Secod quartile (Q): value of Xi such that 50% of the observatios are smaller tha Xi. Third quartile (Q3): value of Xi such that 75% of the observatios are smaller tha Xi. Percetiles: split the data ito hudredths. Midhige: Q 3 + Q Midhige = Example of Quartiles.05 Q.3.06 Q.8.09 Q3 3.00.9 Midhige.6..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 I Excel: Q=PERCENTILE(<array>,0.5) Q=PERCENTILE(<array>,0.5) Q3=PERCENTILE(<array>,0.75) 6

Example of Percetile.05 80-percetile 3.6300.06.09.9..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 I Excel: p-th percetile=percentile(<array>,p) (0 p ) 3 Rage, Iterquartile Rage, Variace, ad Stadard Deviatio Rage: X max! X mi Iterquartile Rage: Q 3! Q ot affected by extreme values. Variace: "( Xi! X ) s = i=! I Excel: s =VAR(<array>) Stadard Deviatio: I Excel: s=stdev(<array>) s = "( Xi! X ) i=! 4 7

Meaigs of the Variace ad Stadard Deviatio The larger the spread of the data aroud the mea, the larger the variace ad stadard deviatio. If all observatios are the same, the variace ad stadard deviatio are zero. The variace ad stadard deviatio caot be egative. Variace is measured i the square of the uits of the data. Stadard deviatio is measured i the same uits as the data. 5 Coefficiet of Variatio Coefficiet of variatio (COV) : o uits.05 S 9.50.06 Average 9.5.09 COV 3.0.9..8.34.34.77.80.83.5..7.6.67.77.83 3.5 3.77 5.76 5.78 3.07 44.9 s / X 6 8

Coefficiet of Skewess Coefficiet of skewess: (X-Xi)^3.05-606..06-60.9.09-596..9-575.. -57.8.8-557.9.34-546.4.34-544.8.77-464.5.80-458..83-453..5-398.9. -388.8.7-379.0.6-38.5.67-30.5.77-306.6.83-98.7 3.5-5.9 3.77-89.6 5.76-5.9 5.78-5. 3.07 476.6 44.9 48007. 3 ) 3!( X i " X s i= 4.033 7 Mea Absolute Deviatio Mea absolute deviatio:! i= abs(xi-xbar).05 8.46 Average 9.5.06 8.45 Mea absolute deviatio 3.6.09 8.4.9 8.3. 8.30.8 8.3.34 8.8.34 8.7.77 7.74.80 7.7.83 7.68.5 7.36. 7.30.7 7.4.6 6.90.67 6.84.77 6.74.83 6.68 3.5 6.00 3.77 5.74 5.76 3.75 5.78 3.73 3.07.56 44.9 35.39 35.90 X i " X 8 9

Shapes of Distributios mode media mea Right-skewed distributio Mode, media, mea Symmetric distributio mode media mea Left-skewed distributio 9 Cofidece Iterval for the Mea The sample mea is a estimate of the populatio mea. Problem: give k samples of the populatio (with k sample meas), get a sigle estimate of the populatio mea. Oly probabilistic statemets ca be made: 0 0

Cofidece Iterval for the Mea Pr[ c # µ # ] c = "! where, ( c, c ) 00 ( "!) "! : cofidece iterval : cofidece level (usually 90 or 95%) : cofidece coefficiet. Cetral Limit Theorem If the observatios i a sample are idepedet ad come from the same populatio that has mea µ ad stadard deviatio σ the the sample mea for large samples has a ormal distributio with mea µ ad stadard deviatio σ/. The stadard deviatio of the sample mea is called the stadard error.

Cetral Limit Theorem Populatio mea = µ Populatio std deviatio = σ Populatio (N values) sample ( values) sample ( values)... sample ( values) x x x M... Average of x,, x M = µ Stadard deviatio of x,, x M = σ /sqrt() 3 Cofidece Iterval 00 (-α)% cofidece iterval for the populatio mea: ( x! " z "! / s /, x + z " / s / ) x : sample mea s: sample stadard deviatio : sample size z : (-α/)-quatile of a uit ormal variate ( N(0,)). "! / 4

Example of Cofidece Iterval Computatio CPU Time (msec) 5.76 4.67 sample mea 4.5 3.77 sample std 7.56.7 alpha 0..83 cof level 90.05 -(alpha/) 0.95.6 z0.95.645 from a Normal Table.06 5.78 c.97 3.5 c 7.04.77.83 With 90% cofidece the populatio mea.77 is i the iterval.97 7.04.9. 4.80.80.34.8..5.09.34 3.07 5 From Excel: Tools > Data Aalysis > Descriptive Statistics Descriptive Statistics (from Excel Aalysis Pack) Mea 9.50589 Stadard Error 6.03 Media.80555 Mode #N/A Stadard Deviatio 9.49833 Sample Variace 870.55 Kurtosis.650 Skewess 4.594 Rage 43.857 Miimum.04793 Maximum 44.905 Sum 8.54 Cout 4 Cofidece Level(95.0%).45604 s 6 3

Box-ad-Whisker Plot Graphical represetatio of data through a five-umber summary. I/O Time (msec) 8.04 9.96 5.68 6.95 8.8 0.84 4.6 4.8 8.33 7.58 7.4 7.46 8.84 5.73 6.77 7. 8.5 5.39 6.4 7.8.74 6.08 Five-umber Summary Miimum 4.6 First Quartile 6.08 Media 7.35 Third Quartile 8.33 Maximum.74 50% of the data lies i the box 4.6 6.08 7.35 8.33.74 7 4