MEASURES OF DISPERSION (VARIABILITY)

Similar documents
Median and IQR The median is the value which divides the ordered data values in half.

Statistics 511 Additional Materials

Chapter 2 Descriptive Statistics

Data Description. Measure of Central Tendency. Data Description. Chapter x i

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Parameter, Statistic and Random Samples

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Estimation of a population proportion March 23,

Random Variables, Sampling and Estimation

(# x) 2 n. (" x) 2 = 30 2 = 900. = sum. " x 2 = =174. " x. Chapter 12. Quick math overview. #(x " x ) 2 = # x 2 "

1 Lesson 6: Measure of Variation

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

CURRICULUM INSPIRATIONS: INNOVATIVE CURRICULUM ONLINE EXPERIENCES: TANTON TIDBITS:

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Elementary Statistics

Stat 421-SP2012 Interval Estimation Section

Topic 10: Introduction to Estimation

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Measures of Spread: Standard Deviation

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Understanding Dissimilarity Among Samples

Topic 9: Sampling Distributions of Estimators

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

CONFIDENCE INTERVALS STUDY GUIDE

Economics Spring 2015

Understanding Samples

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

Census. Mean. µ = x 1 + x x n n


BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

Sample Size Determination (Two or More Samples)

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Stat 139 Homework 7 Solutions, Fall 2015

Properties and Hypothesis Testing

Topic 9: Sampling Distributions of Estimators

Frequentist Inference

MATH/STAT 352: Lecture 15

Measures of Spread: Variance and Standard Deviation

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

(6) Fundamental Sampling Distribution and Data Discription

Topic 9: Sampling Distributions of Estimators

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Statisticians use the word population to refer the total number of (potential) observations under consideration

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Chapter 23: Inferences About Means

Chapter 8: Estimating with Confidence

Summarizing Data. Major Properties of Numerical Data

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

6.3 Testing Series With Positive Terms

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

1 Inferential Methods for Correlation and Regression Analysis

Math 140 Introductory Statistics

Analysis of Experimental Data

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Chapter 6 Principles of Data Reduction

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Estimation for Complete Data

Introducing Sample Proportions

RADICAL EXPRESSION. If a and x are real numbers and n is a positive integer, then x is an. n th root theorems: Example 1 Simplify

4 Multidimensional quantitative data

Final Examination Solutions 17/6/2010

Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

Infinite Sequences and Series

Instructor: Judith Canner Spring 2010 CONFIDENCE INTERVALS How do we make inferences about the population parameters?

Binomial Distribution

AP Statistics Review Ch. 8

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

Statistical Fundamentals and Control Charts

Basics of Probability Theory (for Theory of Computation courses)

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

Introducing Sample Proportions

4.1 Sigma Notation and Riemann Sums

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Formulas and Tables for Gerstman

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Module 1 Fundamentals in statistics

Confidence Intervals for the Population Proportion p

Chapter 6 Sampling Distributions

Probability and statistics: basic terms

Transcription:

POLI 300 Hadout #7 N. R. Miller MEASURES OF DISPERSION (VARIABILITY) While measures of cetral tedecy idicate what value of a variable is (i oe sese or other, e.g., mode, media, mea), average or cetral or typical i a set of data, measures of dispersio (or variability or spread) idicate (i oe sese or other) the extet to which the observed values are spread out aroud that ceter how far apart observed values typically are from each other or from some average value (i particular, the mea). Thus: (a) (b) (c) if all cases have idetical observed values (ad thereby also all have the average value), dispersio is zero; if most cases have observed values that are quite close together (thereby also quite close to the average value), dispersio is low (but greater tha zero); but if may cases have observed values that are quite far apart from may others (or from the average value), dispersio is high. A measure of dispersio provides a summary statistic that idicates the magitude of such dispersio ad, like a measure of cetral tedecy, is a uivariate statistic. Because dispersio is cocered with how close together or far apart observed values are (i.e., with the magitude of the itervals betwee them), it should be apparet that the otio of dispersio make sese ad measures of dispersio are defied oly for iterval (or ratio) variables. (There is oe exceptio: a very crude measure of dispersio called the variatio ratio, which is defied for ordial ad eve omial variables. It will be discussed briefly i the Aswers & Discussio to PS #7.) There are two pricipal types of measures of dispersio: rage measures ad deviatio measures. Rage Measures Rage measures are based o the distace betwee (relatively) extreme values observed i the data ad are coceptually coected with the media as a measure of cetral tedecy (See the data illustratig Percetiles, the Media, ad Rages o the back page of the Hadout #6 o Measures of Cetral Tedecy.) The ( total or simple ) rage is the maximum (highest) value observed i the data (the value of the case at the 100th percetile) mius the miimum (lowest) value observed i the data (the value of the case at the 0th percetile) that is, the distace or iterval betwee the values of these two extreme cases. (Note that this may be less tha the rage of the possible values of the variable, sice logically possible extreme values may ot be observed i actual data; for example, the variable LEVEL OF TURNOUT has logically possible values ragig from 0% to 100%, but i U.S. Presidetial electios, the rage of observed values [as covetioally measured, i.e., as Total Vote for Presidet divided by Votig Age Populatio] over the past 60 years or so rages from a miimum observed of about 48% (i 1996) to about 64% (i 1960). The problem with the (total or simple) rage as a measure of dispersio is that it depeds o the values of just two cases cases that by defiitio have atypical (ad perhaps extraordiarily atypical) values. I particular, the rage

#7 Dispersio page 2 makes o distictio betwee a polarized distributio i which almost all observed values are close to either the miimum or maximum values ad a distributio i which almost all observed values are buched together but there are a few extreme outliers. Also the rage is udefied for theoretical distributios that are ope-eded (the techical term is asymptotic), like the ormal distributio (that we will take up i the ext topic) or the upper ed of a icome distributio type of curve (see PS #5C). Therefore other variats of the rage measure that do ot reach etirely out to the extremes of the frequecy distributio are ofte used i place of the total rage. The iterdecile rage is the value of the case that stads at the 90th percetile of the distributio mius the value of the case that stads at the 10th percetile that is, the distace or iterval betwee the values of these two less extreme cases. I like maer, the iterquartile rage is the value of the case that stads at the 75th percetile of the distributio mius the value of the case that stads at the 25th percetile. (The first quartile is the media observatio amog all cases that lie below the overall media ad the third quartile is the media observatio amog all cases that lie above the overall media. I these terms, the iterquartile rage is third quartile mius the first quartile.) We have previously used a rage measure i a special cotext. The hadout o Radom Samplig said the followig: Suppose the Gallup Poll takes a radom sample of respodets ad reports that the Presidet's curret approval ratig is 62% ad that this sample statistic has a margi of error of ± 3 %. Here is what this meas: if (hypothetically) Gallup were to take a great may radom samples of the same size from the same populatio (e.g., the America VAP o a give day), the differet samples would give differet statistics (approval ratigs), but 95% of these samples would give approval ratigs withi 3 percetage poits of the true populatio parameter. Thus, if our data is the list of sample statistics produced by the (hypothetical) great may radom samples, the margi or error specifies the rage betwee the value of the sample statistic that stads at the 97.5th percetile mius the sample statistic that stads at the 2.5th percetile (so that 95% of the sample statistics lie withi the rage). Specifically (ad lettig P be the value of the populatio parameter) this rage is (P + 3%)!(P! 3%) = 6%, i.e., twice the margi error. Deviatio Measures Deviatio measures are based o average deviatios from some average value. (Recall the discussio of Deviatios from the Average i Hadout #6 o Measures of Cetral Tedecy.) Sice we are dealig with iterval variables, we ca calculate meas, ad deviatio measures are typically based o the mea deviatio from the mea value. Thus the usual deviatio measures are coceptually coected with the mea as a measure of cetral tedecy. Suppose we have a variable X ad a set of cases umbered 1,2,...,. Let the observed value of the variable i each case be desigated x 1, x 2, etc. Thus: x 1 + x 2 +...+ x 3 x mea of X = xg = =.

#7 Dispersio page 3 The deviatio from the mea for a represetative case i is (x i! xg ). If almost all of these deviatios are small (if almost all cases are close to the mea value), dispersio is small; but if may of these deviatios are large (if may cases are much above or below the mea), dispersio is large. This suggests we could costruct a measure D of dispersio that would simply be the average (mea) of all the deviatios: (x 1! xg ) + (x 2! xg ) +... + (x! xg ) 3 (x i! xg ) D = =. But this will ot work, because some of the deviatio are positive ad others are egative ad, as we saw earlier (Hadout #6, poit (d) uder Deviatios from the Average), these positive ad egative deviatios ecessarily balace out ad add up to zero, i.e., for ay distributio of observed values 3(x i! xg ) = 0. A practical way aroud this problem is simply to igore the fact that some deviatios are egative while others are positive by averagig the absolute values of the deviatios (i effect, by igorig the egative sig before each egative deviatio): 3 *x i! xg* MD =. This measure (called the mea deviatio) tells us the average (mea) amout that the values for all cases deviate (regardless of whether they are higher or lower) from the average (mea) value. Ideed, this is a ituitive, uderstadable, ad perfectly reasoable measure of dispersio, ad it is occasioally used i research. However, statisticias are mathematicias, ad they dislike this measure because the formula is mathematically messy by virtue of beig o-algebraic (i that it igores egative sigs). Therefore statisticias, ad most researchers, use aother slightly differet deviatio measure of dispersio that is algebraic, ad that makes use of the fact that the square of ay (positive or egative) umber (i.e., the umber multiplied by itself) other tha zero is itself always positive. This formula is based o fidig the average of the squared deviatios; sice these are all o-egative, they do ot balace out. This measure of dispersio is called the variace of the variable. 3 (x i! xg ) 2 Variace of X = Var(X) = s 2 =. That is, the variace is the average squared deviatio from the mea. Remember from Hadout #6 (poit (e) uder Deviatios from the Average) that the average squared deviatio from the mea value of X is smaller tha the average squared deviatio from ay other value of X. The variace is the usual measure of dispersio i statistical theory, but it has a drawback whe researchers wat to describe the dispersio i data i a practical way. Whatever uits the origial data (ad its average values ad its mea dispersio) are expressed i, the variace is expressed i the square of those uits, ad thus it does't make much ituitive or practical sese. This ca be remedied by fidig the (positive) square root of the variace (which takes us back to the origial uits). This measure of dispersio is called stadard deviatio of the variable:

#7 Dispersio page 4 / 3 (x i! xg ) Stadard Deviatio of X = SD(X) = s = / 2. r I order to iterpret a stadard deviatio, or to make a plausible estimate of the SD of some data, it is useful to thik of the mea deviatio because (i) it is easier to estimate the magitude of the mea deviatio ad (ii) the stadard deviatio has approximately the same umerical magitude as the mea deviatio. More precisely, give ay distributio of data, the stadard deviatio is ever less tha the mea deviatio; it is equal to the mea deviatio if the data is distributed i a maximally polarized fashio; otherwise the SD is somewhat larger typically about 20-50% larger. Sample Estimates of Populatio Dispersio Radom sample statistics that are percetages or averages provide ubiased estimates of the correspodig populatio parameters. However, sample statistics that are dispersio measures provide estimates of populatio dispersio that are biased (at least slightly) dowward. This is most obvious i the case of the rage; it should be evidet that a sample rage is almost always smaller, ad ca ever be larger, tha the correspodig populatio rage. The sample stadard deviatio (or variace) is also biased slightly dowward. (While the SD of a particular sample ca be larger tha the populatio SD, sample SDs are o average slightly smaller tha the correspodig populatio SDs). However, the sample SD ca be adjusted to provide a ubiased estimate of the populatio SD; this adjustmet cosists of dividig the sum of the squared deviatios by!1, rather tha by. (Clearly this adjustmet makes o practical differece uless the sample is quite small. Notice that if you apply the SD formula i the evet that you have just a sigle observatio i your sample, i.e., = 1, it must give SD = 0 regardless of what the observed value is. More ituitively, you ca get o sese of how much dispersio there is i a populatio with respect to some variable util you observe at least two cases ad ca see how far apart they are.) This is why you will ofte see the formula for the variace ad SD with a!1 divisor (ad scietific calculators ofte build i this formula). However, for POLI 300 problem sets ad tests, you should use the formula give i the previous sectio of this hadout. Dispersio i Ratio Variables Give a ratio variable (e.g. icome), the iterestig dispersio questio may pertai ot to the iterval betwee two observed values or betwee a observed value ad the mea value but to the ratio betwee the two values. (For example, oe household poverty level is defied as oe half the media household icome, ad households with more tha twice the media icome are sometimes characterized as well off. The average compesatio of CEOs today is about 250 times that of the average worker, whereas 50 years it was oly about 40 times that of the average worker.) The degree of dispersio i ratio variables ca aturally be referred to as the degree iequality. Oe ratio measure of dispersio/iequality is the coefficiet of variatio, which is simply the stadard deviatio divided by the mea. Aother is the Gii Idex of Iequality, which is based o a compariso betwee the actual cumulative distributio whe cases are raked ordered from lowest

#7 Dispersio page 5 to highest value (e.g., from poorest to richest) ad the cumulative distributio that would exist if all cases had the same value. How to Compute a Stadard Deviatio The formula for the stadard deviatio is: SD(X) = s = 3 (x i! xg ) / 2. r Here is how to use the formula. 1. Set up a worksheet like the oe show below. 2. I the first colum, list the values of the variable X for each of the cases. (This is the raw data.) 3. Fid the mea value of the variable i the data, by addig up the values i each case ad dividig by the umber of cases. 4. I the secod colum, subtract the mea from each value to get, for each case, the deviatio from the mea. Some deviatios are positive, others egative, ad (apart from roudig error) they must add up to zero; add them up as a arithmetic check. 5. I the third colum, square each deviatio from the mea, i.e., multiply the deviatio by itself. Sice the product of two egative umbers is positive, every squared deviatio is oegative, i.e., either positive or (i the evet a case has a value that coicides with the mea value). 6. Add up the squared deviatios over all cases. 7. Divide the sum of the squared deviatios by the umber of cases; this gives the average squared deviatio from the mea, commoly called the variace. 8. The stadard deviatio is the (positive) square root of the variace. (The square root of x is that umber which whe multiplied by itself gives x.)

#7 Dispersio page 6