Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

Measures of Central Tendency and their dispersion and applications Acknowledgement: Dr Muslima Ejaz

LEARNING OBJECTIVES: Compute and distinguish between the uses of measures of central tendency: mean, median and mode. Compute and list some uses for measures of variation of dispersion: range, variance and standard deviation. Understand the distinction between the population mean and the sample mean. Learn the empirical rule and its application. REFERENCES: Basic Statistics for the Health Sciences, Jan W. Kuzma and Stephen E. Bohnenblust, by Mayfield Publishing Company, 2001. An introduction to Statistical Methods and Data Analysis, Lyman Ott PWS-Kent Publishing Company, 1988 9/24/2013 2

Average speed of a car crossing midtown Manhattan during the day is 5.3 miles /hr Average minutes an American father of 4- year-old spend alone with his child each day is 42 Average American man is 5 feet 9 inches and average women is 5 feet 3.6 inches tall The average American man is sick in bed seven days a year missing 5 days of work 9/24/2013 3

Measures of Central Tendency (center of the distribution) Find a single score that is most typical or most representative of the entire group Helpful in comparing groups No single measure representative in every situation - three ways of determining central tendency Mean Median Mode 9/24/2013 4

Mean Also called arithmetic mean or average The sum of all scores divided by the number of scores X n i= = 1 n Xi 9/24/2013 5

Sample Mean Add up all the observations given in the data, then divide by sample size (n) The sample size n is the number of observations 9/24/2013 6

Example; Mean n = 5 Systolic blood pressures ( mmhg) X1 = 120 X2 = 80 X3 = 90 X4 = 110 X5 = 95 9/24/2013 7

Example: Mean X n i = 1 = n Xi Mean Systolic Blood Pressure: X = 495 = 5 99 9/24/2013 8

Pros and Cons of the Mean Pros Mathematical center of a distribution. Just as far from scores above it as it is from scores below it. Does not ignore any information Cons Influenced by extreme scores and skewed distributions One data point could make a great change in sample mean 9/24/2013 9

Example n= 5 Systolic blood pressures ( mmhg) X1 = 120 X2 = 180 X3 = 90 X4 = 110 X5 = 95 Mean Systolic Blood Pressure: X = 595 = 5 119 9/24/2013 10

Population Versus Sample Mean Population The entire group you want information about For example: The blood pressure of all 18- year-old male Medical college students at AKU 9/24/2013 11

Cont Sample A part of the population from which we actually collect information and draw conclusions about the whole population For example: Sample of blood pressures N=five 18-year-old male college students in AKU 9/24/2013 12

Mean Population mu Sample X bar µ X N i= = 1 sigma, the sum of X, add up all scores N n i= = 1 Xi N, the total number of scores in a population 9/24/2013 13 n sigma, the sum of X, add up all scores Xi n, the total number of scores in a sample

The Median The score that divides the distribution exactly in half when observations are ordered The 50 th percentile (50%) Goal: determine the exact midpoint Half of the rank order of observations n+1 / 2 Scores arranged from highest to lowest middle score 9/24/2013 14

Example: Median 110, 90, 80, 95, 120 80, 90, 95, 110, 120 The median is the middle value when observations are ordered. To find the middle, count in (N+1)/2 scores when observations are ordered lowest to highest. Median Systolic BP: (5+1)/2 = 3 9/24/2013 15

Finding the median with an even number of scores. With an even number of scores, the median is the average of the middle two observations when observations are ordered. 80, 90, 95, 110, 120, 125 (95 + 110)/2 = 102.5 9/24/2013 16

Example; Median 80, 90, 95, 110, 220 Median 9/24/2013 17

Pros and Cons of Median Pros Not influenced by extreme scores or skewed distributions Easier to compute than the mean. Cons Doesn t take actual values into account. As its value is determined solely by its rank, provides no information about any of the other values within the distribution 9/24/2013 18

The Mode The highest frequency/most frequently occurring score Applicable to qualitative and quantitative data Could be bi-modal or multi-modal 9/24/2013 19

Central Tendency Example: Mode 75, 76, 90, 90, 95, 99, 100, 120, 120, 135,135, 155, 170, 186, 196, 205, 220 Mode: most frequent observation Mode(s) for Blood Pressure: 90, 120, 135 9/24/2013 20

Pros and Cons of the Mode Pros Easiest to compute and understand. Cons Ignores most of the information in a distribution The score comes from the data set. Small samples may not have a mode 9/24/2013 21

Using different measures of central tendency Two factors are important in making the decision of which measure of central tendency should be used: Scale of measurement (ordinal or numerical) Shape of the distribution of observations. A distribution can be symmetric or skewed to the right, positively skewed or to the left, negatively skewed. 9/24/2013 22

Using different measures of central tendency f(x) In a normal distribution, the mean, median, and mode are the same. µ Mean Median Mode x 9/24/2013 23

The effect of skew on average. In a skewed distribution, the mean is pulled toward the tail. 9/24/2013 24

Using different measures of central tendency The following guidelines help the researcher decide which measure is best with a given set of data: The mean is used for numerical data and for symmetric distribution. y Frequency 0.3 0.0 0.1 0.2-4 -2 0 2 4 Values 9/24/2013 25

Using different measures of central tendency The following guidelines help the researcher decide which measure is best with a given set of data: The median is used for ordinal data or for numerical data whose distribution is skewed. 9/24/2013 26

Using different measures of central tendency The following guidelines help the researcher decide which measure is best with a given set of data: The mode is used primarily for nominal or ordinal data or for numerical data with bimodal distribution Frequency 20 25 30 0 5 10 15 2 0 2 4 6 8 10 Stress Rating 9/24/2013 27

Measures of Variation Or Measures of dispersion 9/24/2013 28

Measures of Variability A single summary figure that describes the spread of observations within a distribution. Centrally located at the Same value on the horizontal axis, but have substantially different amount of variability 9/24/2013 29

Measures of Variability Consider the following two data sets on the ages of all patients suffering from bladder cancer and prostatic cancer. BC PC 47 70 38 33 35 18 40 52 36 27 The mean age of both the groups is 40 years. If we do not know the ages of individual patients and are told only that the mean age of the patients in the two groups is the same, we may assume that the patients in the two groups have a similar age distribution. 9/24/2013 30 45 39 Variation in the patient s ages in each of these two groups is very different. The ages of the prostatic cancer patients have a much larger variation than the ages of the bladder cancer patients.

Measures of Variability Measure the spread in the data Some important measures Range Mean deviation Variance Standard Deviation Coefficient of variation 9/24/2013 31

Variability The purpose of the majority of medical, behavioural and social science research is to explain or account for variance or differences among individuals or groups. Examples 1. What factors account for the variance (or difference) in IQ among individuals? 2. What factors account for the variance in treatment compliance among different groups of patients? 9/24/2013 32

Range The range tells us the span over which the data are distributed, and is only a very rough measure of variability Range: The difference between the maximum and minimum scores 80, 90, 95, 110, 120 Range = 120 80 = 40 9/24/2013 33

Range Range is the simplest measure of dispersion It depends entirely on the extreme scores and doesn t take into consideration the bulk of the observations 9/24/2013 34

X Variation X X X 5 0.00 5 0.00 5 0.00 5 0.00 5 0.00 = 25 n = 5 X = 5 This is an example of data with no i.e. zero variability 9/24/2013 35

Variation X X X X 6 +1.00 4-1.00 6 +1.00 5 0.00 4-1.00 = 25 n = 5 X = 5 This is an example of data with low variability 9/24/2013 36

Variation X X X X 8 +3.00 1-4.00 9 +4.00 5 0.00 2-3.00 = 25 n = 5 = 5 X This is an example of data with high variability 9/24/2013 37

Mean deviation The best measures of dispersion should: take into account all the scores in the distribution and should describe the average deviation of all observations from the mean. Normally, to find the average we would want to sum all deviations from the mean and then divide by n, i.e., X n x 9/24/2013 38

Mean Deviation X X- x n = 6; ΣX = 33 3 3-5.50 = 2.50 X = Σ X/n 5 5-5.50 = 0.50 X = 33/6 9 9-5.50 = 3.50 X = 5.50 2 2-5.50 = 3.50 8 8-5.50 = 2.50 6 6-5.50 = 0.50 = 13 Mean Deviation = 13/ 6 = 2.167 9/24/2013 39

Variance & Standard Deviation However, if we square each of the deviations from the mean, we obtain a sum that is not equal to zero This is the basis for the measures of variance and standard deviation, the two most common measures of variability (or dispersion) of data 9/24/2013 40

Variance & Standard Deviation (cont) X X X ( X X ) 2 8 +3.00 9.00 1-4.00 16.00 9 +4.00 16.00 5 0.00 0.00 2-3.00 9.00 X X X ( ) ( X X ) = 25 = 0.00 = 2 50.00 ( X X ) 2 Note: The is called the Sum of Squares 9/24/2013 41

Steps to calculate Variance Compute the mean. Subtract the mean from each observation. Square each of the deviations. Find the sum of the squares. Divide the sum by N to get the variance Take the square root of the variance to get the standard deviation. 9/24/2013 42

Few Facts The square root of the variance gives the standard deviation (SD) and vice versa Variance is actually the average of the square of the distance that the each value is from the mean Why the squared distances and not the actual ones! Sum of the distances will always be zero, when each value is squared the negative sign is eliminated Why to take the square root? Since distances were squared, the units of the resultant numbers are the squares of the units of the original raw data. Finding the square root of the variance puts the SD in the same units as the raw data. i.e. standard deviation expresses variability in the same units as the data. 9/24/2013 43

Sample Variance The sum of squared deviations from the mean divided by the n - 1 (an estimate of the population variance) s 2 = ( ) X x n 1 2 9/24/2013 44

Variance of a Population The sum of squared deviations from the mean divided by the number of scores (sigma squared): ( X ) µ σ 2 = N 2 9/24/2013 45

Standard Deviation Formulas Population Standard Deviation Sample Standard Deviation σ s = = ( X µ ) N 2 ( ) X x 2 X x n 1 Sample standard deviation usually underestimates population standard deviation. Using n-1 in the denominator corrects for this and gives us a better estimate of the population standard deviation. 9/24/2013 46

Sometimes it is of interest to compare the degree of variability in the distribution of a factor from two different populations or of two different variables from the same populations eg; SBP (factor) among children and adults (two different populations) or among adults the distribution of SBP has more spread than that of DBP 9/24/2013 47

Coefficient of variation: expresses the SD as proportion of the mean It is a dimensionless measure of the relative variation. Constructed by dividing the standard deviation by the mean and multiplying by 100. CV = (SD/mean) * (100) It depicts the size of standard deviation relative to its mean Used to compare the variability in one data set with that in another when a direct comparison of standard deviation is not appropriate. 9/24/2013 48

Coefficient of variation The formula is: CV = (s/x) (100) Suppose two samples of human males yield the following results: Mean age Mean wt SD Adults 25 yrs 145lbs 10lbs Childr en 11 yrs 80lbs 10lbs CV 6.9% 12.5% 9/24/2013 49

Using different measures of dispersion The following guidelines help investigators decide which measure of dispersion is most appropriate for a given set of data: The standard deviation is used when the mean is used i.e., with symmetric distributions of numerical data The range is used with numerical data when the purpose is to emphasize extreme values. The coefficient of variation is used when the intent is to compare two numerical distributions measured on different scales. 9/24/2013 50

Empirical Rule Specifies the proportion of the spread in terms of the standard deviation It applies to the normal symmetric or bell- shaped distribution Approx 68% of the data values will fall within 1 SD of the mean Approx 95% of the data values will fall within 2 SD of the mean Approx 99.7% of the data values will fall within 3 SD of the mean 9/24/2013 51

Empirical Rule Approximate percentage of area within given standard deviations 99.7% 95% 68% 9/24/2013 52 Assume the distribution of underlying variable is symmetric and bell shaped (Normal)

Example Scores on a National Achievement Exam have a mean of 480 and a SD of 90. And if these scores are normally distributed, then approximately 68% will fall between 390 & 570 approximately 95% will fall between 300 & 660 approximately 99.7% will fall between 210 & 750 9/24/2013 53

Application of the Empirical Rule Women participating in a three-day experimental diet regime have been demonstrated to have normally distributed weight loss with mean 600 g and a standard deviation 200 g. a) What percentage of these women will have a weight loss between 400 and 800 g? b) What percentage of women will lose weight too quickly on the diet (where too much weight is defined as >1000g)? 9/24/2013 54

a) X : (600,200) ~ 68% 0 200 400 600 800 1000 1200 9/24/2013 55

b) X : (600,200) 2.3% 0 200 400 600 800 1000 1200 9/24/2013 56