TOPIC 6 MEASURES OF VARIATION

Similar documents
Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

MEASURES OF DISPERSION (VARIABILITY)

Chapter 2 Descriptive Statistics

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Statistics 511 Additional Materials

1 Inferential Methods for Correlation and Regression Analysis

Properties and Hypothesis Testing

Measures of Spread: Standard Deviation

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

6.3 Testing Series With Positive Terms

Median and IQR The median is the value which divides the ordered data values in half.

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Data Analysis and Statistical Methods Statistics 651

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

Understanding Samples

ANALYSIS OF EXPERIMENTAL ERRORS

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Lecture 24 Floods and flood frequency

Stat 421-SP2012 Interval Estimation Section

Estimation of a population proportion March 23,

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

AP Statistics Review Ch. 8

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Data Description. Measure of Central Tendency. Data Description. Chapter x i

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Infinite Sequences and Series

Final Examination Solutions 17/6/2010

1 Lesson 6: Measure of Variation

Expectation and Variance of a random variable

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Sample Size Determination (Two or More Samples)

This is an introductory course in Analysis of Variance and Design of Experiments.

Measures of Spread: Variance and Standard Deviation

Chapter 8: Estimating with Confidence

Chapter 23: Inferences About Means

Computing Confidence Intervals for Sample Data

Random Variables, Sampling and Estimation

Topic 9: Sampling Distributions of Estimators

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

Analysis of Experimental Data

Frequentist Inference

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Topic 9: Sampling Distributions of Estimators

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Mathematical Notation Math Introduction to Applied Statistics

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

Statisticians use the word population to refer the total number of (potential) observations under consideration

There is no straightforward approach for choosing the warmup period l.

11 Correlation and Regression

Parameter, Statistic and Random Samples

(6) Fundamental Sampling Distribution and Data Discription

Output Analysis (2, Chapters 10 &11 Law)

Topic 9: Sampling Distributions of Estimators

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Confidence Intervals for the Population Proportion p

SNAP Centre Workshop. Basic Algebraic Manipulation


Measures of Variation

Read through these prior to coming to the test and follow them when you take your test.

The standard deviation of the mean

HUMBEHV 3HB3 Measures of Central Tendency & Variability Week 2

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

GG313 GEOLOGICAL DATA ANALYSIS

Chapter 6. Sampling and Estimation

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Census. Mean. µ = x 1 + x x n n

Activity 3: Length Measurements with the Four-Sided Meter Stick

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 10: Introduction to Estimation

A statistical method to determine sample size to estimate characteristic value of soil parameters

Kinetics of Complex Reactions

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

NCSS Statistical Software. Tolerance Intervals

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Error & Uncertainty. Error. More on errors. Uncertainty. Page # The error is the difference between a TRUE value, x, and a MEASURED value, x i :

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Lecture 2: Monte Carlo Simulation

Power and Type II Error

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Chapter 6 Sampling Distributions

Math 140 Introductory Statistics

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Confidence Intervals รศ.ดร. อน นต ผลเพ ม Assoc.Prof. Anan Phonphoem, Ph.D. Intelligent Wireless Network Group (IWING Lab)

MA238 Assignment 4 Solutions (part a)

Economics Spring 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Module 1 Fundamentals in statistics

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Simulation. Two Rule For Inverting A Distribution Function

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Transcription:

TOPIC 6 MEASURES OF VARIATIO If people s eyes ted to blak out tables if figures, you ca be dar sure that they blak out the small writig that goes aroud them. Ala Graham, 1994 The cocept of variatio sometimes averages are t eough A measure of the average value ca provide a lot of useful iformatio about a set of observatios, but i may cases it is ot sufficiet to tell us everythig about the variable. Cosider, for example, Figure 6.1 below: Figure 6.1 Compariso of Two Distributios Populatio A Populatio B the distributios are differet yet give the same averages While the two distributios show have the same average values, whether measured as a mea, a media, or a mode, we could ot say that the distributios were the same. To describe ad compare them we eed additioal iformatio; we eed alterative ways of describig the distributios. After the average value, the ext most importat property of the distributio that we eed to measure is the variability of the distributio. From Figure 6.1 we ca see that distributio B is much more variable (or spread out) tha distributio A. I this sectio we shall look at differet ways of measurig variability. actual level of variability We wat to measure variability for two mai reasos. Firstly we may be iterested i the actual level of variability ad i comparig this with aother distributio. If we are lookig at icome distributios, for example, the the govermet may be iterested ot oly i the average icome level, but also i the variability of icome level betwee people ad also betwee differet regios of a coutry. May policies are desiged to help redistribute icome from the richest to the poorest (thereby reducig the Data Aalysis Course Topic 6-15

Relative frequecy Measures of variatio variability of icome levels), ad so we would eed to measure variability to see if it chages over time. variability due to sample variatio The secod reaso for watig to measure variability is whe we use samplig to compare populatio values. We the eed to take variability ito accout. We wat to be able to distiguish betwee differeces that might have just happeed by chace (that is, i the selectio of the samples) ad those that idicate some real chage. example Let us look at a example where we are comparig two populatio distributios. Figure 6. Compariso of Icome Levels of Two Populatios Populatio A Populatio B Icome variability is ot ecessarily reflected i averages Populatio A represets the distributio of aual icome per household i oe regio ad Populatio B represets the distributio of aual icome of households i aother regio. Both have the same mea level of icome of $1,800 per year, but we caot say that the two distributios are the same. The distributio of icome of Populatio B is far more spread out tha Populatio A. It also has, therefore, a greater degree of variability. two differet measures of variability It is clear that we should ot oly compare measures of locatio whe lookig at populatios, but also measures of variability. I this topic we shall cosider two differet measures of variability which are basically of two types: a. measures of the distace betwee represetative values of the populatio; ad b. measures of the distace of every uit of the populatio from some specified cetral value. rage ad stadard deviatio for ugrouped data As examples of these measures of variability we shall look i this topic at the rage ad the stadard deviatio (or variace) for ugrouped data. More complicated techiques (such as fidig the stadard deviatio whe the observatios are grouped i a frequecy distributio) are covered i advaced traiig. The rage largest smallest The simplest way to measure the variatio or spread, give a set of observatios, is to calculate the rage. The rage of a set of observatios is defied to be the differece betwee the smallest ad the largest values i the set. This is very simple to uderstad ad easy to calculate ad so has a obvious appeal. It is used i practice, but is oly really useful whe the variable uder cosideratio has a fairly eve spread of values over the rage. It has some obvious drawbacks which ted to restrict its use i Topic 6-16 Secretariat of the Pacific Commuity

practice; some of the more importat disadvatages are: disadvatages a. because the rage is the differece betwee the largest ad the smallest value, it is very sesitive to very large or very small observatios. The iclusio of just oe freak (that is, rare or uusual) value will greatly affect the rage; b. the rage is depedet o the umber of observatios. Icreasig the umber of observatios ca oly icrease the rage; it ca ever make it less. This meas that it is difficult to compare rages for two distributios with differet umbers of observatios; c. while the rage is very easy to calculate, it has the disadvatage that it igores all the data i betwee the highest ad the lowest values. If, for example, we cosider the followig three sets of data: Set 1 3, 5, 7, 9, 11, 13, 15, 17, 17, 17, 17 Set 3, 5, 5, 5, 17, 17, 17, 17, 17, 17, 17 Set 3 3, 6, 7, 8, 10, 11, 14, 14, 15, 16, 17 we see that the rage for all three sets are the same (17-3 = 14), but the degree of variatio is by o meas the same; d. it is difficult to calculate the rage for data grouped i a frequecy distributio. All we ca really do is take the differece betwee the lower limit of the first class ad the upper limit of the last class. This would obviously deped o the defiitios of the classes, ad is impossible if you have a ope-eded class. However, some judgmets ca be made depedig o the kowledge of the subject matter uder observatio. For practical purposes the ope-eded classes are usually closed by guessig a value for the ope-ed. example Let us cosider aother example, the values of imports i various Pacific islad coutries i 1995. Table 6.1 Total imports by coutry, 1995 (i thousad AUD) Coutry Value of imports (A$ 000) Cook Islads 65,363 Fiji 1,17,05 Kiribati 47,547 Marshall Islads 100,073 Papua ew Guiea 1,741,935 Samoa 16,689 Solomo Islads 4,54 Toga 98,047 Tuvalu 1,535 Vauatu 14,51 Source SPESS 14, 1998, Pacific Commuity, oumea method The rage of the import values is the differece betwee the largest ad the smallest value ad i this case the rage is: Data Aalysis Course Topic 6-17

Rage = $ (1,741,935,000 1,535,000) = $1,79 millio do ot usually calculate rage for grouped data The rage from a grouped frequecy distributio is ot usually calculated because of the reasos give i the sectio o disadvatages of the rage. However, it ca be obtaied approximately by takig the differece betwee the upper limit of the last class ad the lower limit of the first class. We must ote that it ca sometimes be very difficult ad at times meaigless if either or both of these classes are ope-eded. Let us cosider oce agai the example of aual household cash icome i two regios of a coutry, which are give i the followig frequecy distributios: Icome ($) Table 6. Compariso of the rage of icome of two regios Regio A Aual Household Cash Icome Frequecy (o. of Households) Icome ($) Regio B Frequecy (o. of Households) Less tha 500 (00*) 137 Less tha 1,000 (500*) 86 500 999 78 1,000-1,999 137 1,000-1,499 406,000 -,999 64 1,500-1,999 331 3,000-3,999 47,000-4,999 188 4,000-6,999 130 5,000-9,999 59 7,000-9,999 6 10,000-19,999 138 10,000 & over (0,000*) 88 0,000 & over (30,000*) 14 Total 1,751 614 Source: Table 5.1 (illustrative data oly) * = Assumed limits ope-eded class itervals Obviously we caot calculate the rage of icome i such cases because of the presece of opeeded class itervals at both eds. However, if we do have to calculate icome rages for the two populatios, we will be forced to make some assumptios. These assumptios may be well-fouded or ill-fouded, but evertheless, if a decisio has to be made, we will have to put some values i the opeeded classes. I the example above, the assumed values are: Assumed Values ($) Regio A B Lower limit (first class) 00 500 Upper limit (last class) 30,000 0,000 rage The icome rages for the distributios may ow be calculated as follows: Regio A Regio B Icome Rage = $(30,000-00) = $9,800 Icome Rage = $(0,000-500) = $19,500 clearly state assumptios I the above example, although we have derived the icome rages as $9,800 for regio A ad $19,500 for regio B the rages could be meaigless if it was later realised that the assumed values were icorrect. However, statisticias ad plaers are ofte cofroted with such problems i their Topic 6-18 Secretariat of the Pacific Commuity

everyday work ad decisios such as those take i the case of the rages of icome i regios A ad B are the types of decisios which they have to live with. The importat thig is that the assumptios applied to geerate a result are clearly stated. use poits other tha the highest ad lowest We ca get aroud most of the problems of the rage as a measure of the variatio by usig other poits i the distributio rather tha the two extreme poits. Aother choice would be to measure what we call the quartile deviatio or the semi iter-quartile rage (that is, to measure the mea average differece betwee the upper ad lower quartiles). For a discussio of upper ad lower quartiles refer to Topic 5, More o measures of locatio. The quartile deviatio is ot icluded i these otes, but covered i the advaced aalysis course. use percetiles Aother alterative is to use the differece betwee, say, the 10 th ad the 90 th percetile (that is, those values for which 10 per cet ad 90 per cet of the observed values are below). As measures of variatio, both of these are quite useful. They are ot affected by ay oe or two extreme or rare observatios, they are less depedet o the umber of observatios, ad they will ted to differetiate betwee differet sets of observatios. I the case of ugrouped frequecy distributios, we ca early always calculate these values. I the case of grouped frequecy distributios, a problem occurs whe oe of the percetile or quartile values falls i a ope-eded class. Stadard deviatio stadard deviatio as a measure of spread Although the rage is a simple measure of variatio or spread, it has may disadvatages. We therefore eed a measure which will overcome these disadvatages while still providig a good measure of variatio. Oe method is the mea deviatio where we measure the distace of observatios from the mea. However, the mea deviatio icorporates absolute values ad these are difficult to deal with mathematically. The stadard deviatio is based o the same priciples as the mea deviatio, but i this case we elimiate the sigs of the deviatios from the mea by squarig them. method How does the stadard deviatios work? Like the mea, the stadard deviatio takes all the observed values ito accout. If there were o dispersio at all i a distributio, all the observed values would be the same. The mea would also be the same as this repeated value. So if everyoe had the same height of 180cm, the mea would be 180cm. o observed value would deviate or differ from the mea. But, with dispersio, the observed values do deviate from the mea, some by a lot, some by oly a little. Quotig the stadard deviatio of a distributio is a way of idicatig a kid of average amout by which all the values deviate from the mea. The greater the dispersio, the bigger the deviatios ad the bigger the stadard deviatio. priciple of stadard deviatio The stadard deviatio is foud by addig the squares of the deviatios of the idividual values from the mea of the distributio, dividig this sum by the umber of items i the distributio, ad the fidig the square root of this umber. Lets ow explai the procedure for calculatig the stadard deviatio i more detail. I terms of a populatio cosistig of values x 1, x, x 3... x with a mea (proouced mu or mew) the stadard deviatio of a populatio is defied as: Data Aalysis Course Topic 6-19

Formula Stadard Deviatio ( ) = i1 ( x i Defiitio To describe the formula we will work through the steps to calculate the stadard deviatio. First we calculate : is calculated the same way as x i the previous chapter (i.e. we add up all the umbers ad divide by how may umbers there were). We call it whe we are dealig with a populatio, rather tha x whe it is a sample. We subtract from each x value: (x i - ) Square each of these values: (x i - ) Sum these values to get the total: ( x ) Divide by the umber of uits i the populatio (): Take the square root of everythig: i1 (x i1 i (x i The stadard deviatio of a populatio is deoted by (the Greek letter for small sigma). variace The square of the stadard deviatio is called the variace ad is deoted by. Whe we square the result of a formula which has a square root, the square root sig is cacelled out ad disappears. We the have: formula Variace ( ) = i1 (x i sample variace If we are dealig with a sample ad wish to calculate the sample variace (or sample stadard deviatio) i order to estimate the value for the populatio, the formula is chaged slightly. I this case s stads for the sample variace, x the sample mea, ad the sample size. The formula for the sample variace is the: Sample Variace (s ) = i1 ( x i x) ( 1) Topic 6-130 Secretariat of the Pacific Commuity

ad the sample stadard deviatio is give by: Sample Stadard Deviatio (s) = i1 ( x i x) 1 sample = (-1) These formulae for samples are effectively the same as those for populatios, except that we have used the divisio ( - 1) istead of. The importat thig to remember is that whe calculatig the variace or stadard deviatio of a sample, divide by ( - 1). Whe calculatig the variace or stadard deviatio of a populatio, divide by. large stadard deviatio = large spread ote that the more the values of idividual items differ from the mea, the greater will be the square of these differeces ad therefore the greater the sum of squares. Therefore, the greater the sum of squares, the larger will s (the stadard deviatio) be. Hece, the greater the dispersio, the larger the stadard deviatio will be. example We will ow go through the calculatio of the stadard deviatio usig the followig data. Table 6.3: 000 Secodary School Erolmet by Provice, PG Deviatio Deviatios Provice Erolmets from mea squared Wester 961 -,470 6,100,900 Gulf 1,53-1,908 3,640,464 CD 4,854 1,43,04,99 Cetral 3,344-87 7,569 Oro 3,134-97 88,09 SHP 1,68-1,749 3,059,001 EHP 5,768,337 5,461,569 Simbu 6,18,751 7,568,001 Mea = 3,431 0 7,950,64 Source: Illustrative data oly first fid the mea To calculate the stadard deviatio we first calculate x. x = i1 x i = (961 + 1,53 + 4,854 + 3,344 + 3,134 + 1,68 + 5,768 + 6,18)/8 = 3,431 I Colum 3 we subtract the mea value from the values for each year. I Colum 4 we square the deviatios ad sum these squared deviatios, givig a total of 7,950,64. Data Aalysis Course Topic 6-131

data from a populatio If the above data are cosidered to be from a populatio, the to derive the stadard deviatio we divide the sum of the squared deviatios by the umber of the observatios ( = 8) ad take the square root. I this case we have: Populatio Stadard Deviatio () = 7,950,64 8 = 3,493,830. 5 = 1,869.18 data from a sample However, if the data are cosidered to be a sample from a populatio, the to derive the stadard deviatio we divide the sum of the squared deviatios by oe less tha the umber of the observatios or idustries (-1 = 7) ad take the square root. I this case we have: Sample Stadard Deviatio (s) = 7,950,64 7 = 3,99,948. 86 = 1,998.4 I this example we would probably cosider the data to be sample data, so would divide by 7. awkward with a large set of umbers Although this is a fairly simple procedure to calculate the stadard deviatio of a small set of umbers, it is quite a cumbersome procedure for a large set of umbers. First of all we have to determie the mea of the set, the calculate the deviatios of each observatio from the mea, square these ad add them up. Eve with the aid of a calculator the operatios take quite a lot of time. It is best to use a computer to perform the calcuatios. rearrage the formula We ca, however, make the calculatio much easier by rearragig the formula for the variace. Thus, for a sample, we have: Sample formula s = i1 ( x i x) 1 = i1 xi xi i1 1 populatio formula = i1 ( x ) i = i1 xi xi i1 steps for sample variace Lets ru through this formula for the sample variace. For the sample variace we first square each idividual x value: x i We the calculate the sum of those squared umbers: x i i1 Call this total A. We also calculate the total of the idividual x values: i1 xi Topic 6-13 Secretariat of the Pacific Commuity

We square this total: i1 xi ad divide by (the umber i the sample): xi i1 Call this total B We the take A - B ad divide by (-1): i1 xi xi i1 1 ot as complicated as it looks Although the secod formula looks more complicated, it is i fact much easier to use whe we are usig a calculator. For example, let us cosider the followig sample values which are the same observatios that we had cosidered i Table 6.3. example 961 1,53 4,854 3,344 3,134 1,68 5,768 6,18 total ad the mea of the observatios a. Calculatig the variace of the sample the first way would etail firstly obtaiig the total ad the mea of the observatios. We have: x i = 7,448, = 8 x = 3,431 secod method The deviatios from the mea are: -,470-1,908 1,43-87 -97-1,749,337,751 The sum of the squares of the deviatio is 7,950,64. Thus, the variace is: s = 7,950,64 / 7 = 3,99,948.86 b. Calculatig the variace usig the secod method or formula we eed: x i = 7,448 ad x i = 1,14,730 s = i1 xi xi i1 1 = [1,14,730 - {(7,448) / 8}] / 7 = (1,14,730 94,174,088) / 7 s = 3,99,948.86 secod method is easier ad faster Thus we see that if we use the memory fuctio i a calculator, the secod calculatio ca be doe without havig to write ay itermediate results. You will also ote that the variace derived usig either of the two methods is the same (3,99,948.86) except that the secod method is easier ad faster. Data Aalysis Course Topic 6-133

Properties of the stadard deviatio remember Whe usig the stadard deviatio it is importat to remember the followig poits: the stadard deviatio is used oly to measure the spread about the mea; the stadard deviatio is ever egative; the stadard deviatio is sesitive to extreme values (called outliers). A sigle outlier ca raise the stadard deviatio a great deal, distortig the picture of spread; ad the greater the spread, the greater the stadard deviatio. Coefficiet of variatio the mea adds meaig to the stadard deviatio The stadard deviatio by itself is ot very meaigful uless it is cosidered alog with the arithmetic mea. For example, a stadard deviatio of $100 whe the mea icome is $10,000 implies a much greater relative variatio tha a stadard deviatio of $100 for a mea GDP figure of $10,000,000. Also, comparig the variability of two populatios with differet uits of measuremet (for example, icome levels i Papua ew Guiea (Kia) ad Vauatu (Vatu) ca be very difficult. iterested i variatio from the mea Hece, the variability i a set of observatios ca usefully be measured relative to a cetral measure such as the arithmetic mea. Such a measure is provided by the coefficiet of variatio, which is the ratio of the stadard deviatio to the arithmetic mea, usually expressed as a percetage, ad is give by the formula: formula Coefficiet of Variatio (C.V.) = ( / x ) 100 (The 100 coverts the umber to a percetage.) ca compare data To compare the variability of two sets of figures would therefore ivolve comparig their respective coefficiets of variatio. The coefficiet of variatio allows for comparisos whe: o the meas of the distributios beig compared are far apart, or o the data are i differet uits. percetage The uits are coverted to a commo deomiator (a percet). example If we look at the data i Table 6.3, we ca calculate the coefficiet of variatio as: C.V. = ( / x ) * 100 = 1,869.18 / 3,431 * 100 = 54.48% illustrative example Let s use some made up data to illustrate the coefficiet of variatio. The mea icome of homeowers i Australia is $40,000 with a stadard deviatio of $4,000. I Topic 6-134 Secretariat of the Pacific Commuity

Kiribati, the mea icome of home owers is $1,000 with a stadard deviatio of $1,00. (ote that the meas are far apart ad the stadard deviatios are differet. Compare ad iterpret the relative dispersio i the two groups o icomes. solutio The first impulse is to say that there is more dispersio i the icomes i Australia because the stadard deviatio is greater. However, whe we covert the two measuremets to relative terms usig the coefficiet of variatio, we fid that the relative dispersio is the same. Australia Kiribati CV(Australia) = ( / x ) * 100 = $4,000/$40,000 * 100 = 10% CV(Kiribati) = ( / x ) * 100 = $1,00/$1,000 * 100 = 10% similar CV I summary the icome for both Australia ad Kiribati have similar amout of variatio. example We could also compare two differet types of data icomes ad age of homeowers. We could compare the spread of icomes of homeowers i Australia with say the spread of the age of homeowers. The mea age of homeowers is 40 years with a stadard deviatio of 10 years. age CV(age) = ( / x ) * 100 = (10/40) * 100 = 5% CV(icome) = 10% We ca see that there is greater relative dispersio i the ages of the homeowers tha i their icomes. ormal distributio used extesively A particular distributio that is used extesively i statistical theory is the ormal distributio: Data Aalysis Course Topic 6-135

properties The ormal distributio has several key properties. o o o it is symmetrical; it is bell shaped; mea of the distributio is the peak; ad o the area uder the curve is always 1. always have the four ormal distributios ca have differet meas ad stadard deviatios, but they always have these four key properties. everyday examples May pheomea i every day life ca be described by the ormal curve, for example people s height. A small umber of people i the populatio are very short, a small umber are very tall, ad the majority of the populatio fall i some middle rage. May other pheomea are also ormally distributed, for example test scores ad weights of people. We could discuss the ormal distributio extesively, but for ow that is all you eed to kow. Referece Rages for a Stadard Deviatio aalysis of data ormally distributed Whe aalysig ormally distributed data, the stadard deviatio is used with the mea to calculate where the data lie withi certai referece rages. The most importat thig to uderstad about referece rages is that for ay set of ormally distributed data: referece rages about 68% of the data lie i the iterval x - s < x < x + s (That is, 68% of the data lie i the rage from the mea mius the stadard deviatio to the mea plus the stadard deviatio) about 95% of the data lie i the iterval x - s < x < x + s about 99% of the data lie i the iterval x - 3s < x < x + 3s where x = the mea; ad s = the stadard deviatio 68% referece rage If we look at the data i Table 6.3, we ca calculate the 68% referece rage for the data as: 68% Referece rage: ( x - s, x + s) (3431-1998.4, 3431 + 1998.4) (143.76, 549.4) That is, 68 % of the data lies i the rage 1,43.76 to 5,49.4. 95% referece rage We ca calculate the 95% referece rage as: 95% Referece rage: ( x - s, x + s) (3431 - (1998.4), 3431 + (1998.4)) Topic 6-136 Secretariat of the Pacific Commuity

(3431 3996.48, 3431 + 3996.48) (-565.48, 747.48) That is, 95 % of the data lies i the rage 565.48 to 7,47.48. Summary of the measures of variability RAGE is easily calculated, except for frequecy distributios, ad is well uderstood; is based o the two extreme observatios ad is thus very ustable; is difficult to maipulate mathematically; provides o iformatio about the geeral behaviour of the distributio; should oly be used as a rough guide to the level of variability. VARIACE/STADARD DEVIATIO is a measure of variability usig iformatio from every observatio; with some maipulatio, the calculatios are reasoably straight-forward; has a cetral role i mathematical ad statistical theory ad is very widely used; ca be affected by extreme values; is the most commoly used measure of variability. COEFFICIET OF VARIATIO is idepedet of the uits of observatios. Therefore, it is useful i comparig distributios where the uits of observatios are differet; a disadvatage of the coefficiet of variatio is that it is ustable whe the arithmetic mea is close to zero. Data Aalysis Course Topic 6-137

Oe fial characteristic of a distributio uderstad the uderlyig structure The objective of summarisig a set of data is to make it possible to comprehed the uderlyig structure ad patter of the distributio of the values of the variable uder cosideratio. The attempt i summarisig the data is to reduce them to a few measures which would give us a idicatio of the cetral values, variatio of the values, cocetratio of the frequecies ad shape of the distributio. The frequecy distributio describes the populatio we are cosiderig, ad the measures of locatio ad variatio help us to characterise the distributio by simple measures. skewed distributios asymmetrical Aother way of characterisig a distributio is to study its skewess (that is, whether the distributio is ot symmetrical ad, if ot, whether the observatios are cocetrated i the low or high values). Examples of skewed distributios are icome, lad holdig size ad household size. For such distributios, oe is iterested i fidig out the type of skewess, whether there are more uits with low values tha uits with high values, or whether there are more uits with high values tha uits with low values. 'right tail A distributio is said to be positively skewed if large frequecy values are cocetrated to the left of the distributio ad the distributio has small frequecy values to the right of the distributio (that is, the distributio has a right tail ad has more low values tha high values). left tail A distributio is said to be egatively skewed if large frequecy values are cocetrated to the right of the distributio ad the distributio has small frequecy values to the left of the distributio (that is, the distributio has a left tail ad has more high values tha low values). three mai features A distributio ca be cosidered to have three mai features which are of iterest i studyig a populatio. These features are: 1 its cetral values; its variatio from the cetral values; 3 whether the distributio is symmetric about the cetral values; ad if ot symmetric, whether it is leaig to the left or right. Topic 6-138 Secretariat of the Pacific Commuity

Exercises 1. The local bus compay employs 10 people. The legth of service, i completed years, for each employee is as follows: 8 8 1 4 1 8 8 7 3 (a) (b) (c) (d) Calculate the rage. Calculate the stadard deviatio (assume the values are sample values). Calculate the coefficiet of variatio. Calculate the referece rage which cotais approximately 68% of observatios.. Customs files reveal the ages of persos leavig the coutry. A sample of ages are: 16, 41, 5, 1, 30, 17, 9, 50, 30 ad 39. (a) (b) (c) (d) Calculate the rage. Assume the values are sample values ad calculate the sample variace usig the secod method of calculatig the variace. Calculate the coefficiet of variatio. Calculate the referece rage that cotais approximately 95% of observatios. Data Aalysis Course Topic 6-139

3. The local market reported the followig umber of people buyig vegetables for the past 9 days: 81 65 58 47 30 51 9 85 4 (a) (b) (c) (d) Calculate the rage. Calculate the stadard deviatio (assume the values are sample values). Calculate the coefficiet of variatio. Calculate the referece rage that cotais approximately 95% of observatios. Topic 6-140 Secretariat of the Pacific Commuity

Self-Review 1. The followig data represet the amout spet (i dollars) by a radom sample of 14 households o basic food items for oe moth: 57 34 7 41 5 18 39 33 37 39 38 47 31 4 (a) (b) (c) (d) Calculate the rage. Calculate the sample stadard deviatio. Calculate the coefficiet of variatio. Calculate the referece rage that cotais approximately 99% of observatios. Data Aalysis Course Topic 6-141

Topic 6-14 Secretariat of the Pacific Commuity

Excel fuctios More statistical fuctios I Topic 5, you were show how to use the fuctios related to Measures of Locatio. I this sectio, those relevat to Measures of Variatio are illustrated. You do t have to use the fuctios istead you ca set up a worksheet with the three colums (observatio, deviatio from the mea ad deviatios squared). See the computer otes for Topic 7 to set up the worksheet to calculate the variace, stadard deviatio ad stadard error from sample data. You have to be careful because the way your sample was selected determies how the stadard error is calculated. If you have ay doubts about the correct formula to use, cotact the SPC Statistics Programme for help. Whe calculatig the variace or stadard deviatio, it might be more useful to use the worksheet method rather tha the Excel fuctio. If you have the colums set up i your worksheet you ca see the differet compoets of the equatio (x etc), ad it would be easier to fid out why you had a larger or smaller tha expected deviatio i your data. You also have to be aware that Excel uses its average fuctio which icludes 0 values i the cout of observatios () which might ot be appropriate i all circumstaces. The rage You do t really eed to use a fuctio to calculate the rage use the sort buttos o the Stadard toolbar. You ca sort from smallest to largest with the butto, ad from largest to smallest with the butto. Be careful whe you sort data either select ALL your data, or click with the mouse i the colum you wat to sort by: it is very easy to corrupt your data with the sort buttos (you do t get a warig like you do with the sort optio o the Data meu). Populatio variace Excel calculates the variace for a POPULATIO usig the formula: which is a differet way of writig the oe used i your otes. Format: Exampl e = varp(cell rage) =varp(a1:a333) will calculate the variace for the POPULATIO i cells A1 to cell A333. Sample variace Excel calculates the variace for a SAMPLE usig the formula: which agai is a differet way of writig the oe used i your otes. Format: = var(cell rage) Example =var(a1:a333) will calculate the variace for the SAMPLE i cells A1 to cell A333. Data Aalysis Course Topic 6-143

Populatio stadard deviatio Excel calculates the stadard deviatio for a POPULATIO usig the formula: which is a differet way of writig the oe used i your otes. Format: Example = stdevp(cell rage) = stdevp(a1:a333) will calculate the stadard deviatio for the POPULATIO i cells A1 to cell A333. Sample stadard deviatio Excel calculates the stadard deviatio for a SAMPLE usig the formula: which agai is a differet way of writig the oe used i your otes. Format: Example = stdev(cell rage) =stdev(a1:a333) will calculate the stadard deviatio for the SAMPLE i cells A1 to cell A333. Cofidece iterval You ca use Excel to calculate the cofidece iterval for a mea. You have to type i the stadard deviatio so the fuctio is ot that user friedly. Format: Example = cofidece(alpha,stadard_dev,size) Where alpha is the sigificace level used to compute the cofidece level. The cofidece level equals 100*(1 - alpha)%, or i other words, a alpha of 0.05 idicates a 95 percet cofidece level. Stadard_dev is the populatio stadard deviatio for the data rage ad is assumed to be kow. Size is the sample size. Suppose we observe that, i a sample of 50 commuters, the average legth of travel to work is 30 miutes with a populatio stadard deviatio of.5. We ca calculate with 95% cofidece that the populatio mea is i the iterval: =COFIDECE(0.05,.5,50) equals 0.69951. I other words, the average legth of travel to work equals 30 ± 0.69951 miutes, or 9.3 to 30.7 miutes. Topic 6-144 Secretariat of the Pacific Commuity