MEASURES OF LOCATION AND SPREAD

Similar documents
Unit 2. Describing Data: Numerical

Chapter 1 - Lecture 3 Measures of Location

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami

additionalmathematicsstatisticsadditi onalmathematicsstatisticsadditionalm athematicsstatisticsadditionalmathem aticsstatisticsadditionalmathematicsst

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Statistics and parameters

Numerical Measures of Central Tendency

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Chapter 3. Measuring data

Sampling (Statistics)

Describing distributions with numbers

A is one of the categories into which qualitative data can be classified.

Lecture 2. Descriptive Statistics: Measures of Center

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Determining the Spread of a Distribution Variance & Standard Deviation

P8130: Biostatistical Methods I

Chapter 2 Descriptive Statistics

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

KCP e-learning. test user - ability basic maths revision. During your training, we will need to cover some ground using statistics.

Topic-1 Describing Data with Numerical Measures

Lecture 2 and Lecture 3

BIOS 2041: Introduction to Statistical Methods

Example 2. Given the data below, complete the chart:

CIVL 7012/8012. Collection and Analysis of Information

MAT Mathematics in Today's World

Chapter 3. Data Description

Revision Topic 13: Statistics 1

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

SESSION 5 Descriptive Statistics

MATH 117 Statistical Methods for Management I Chapter Three

Histograms allow a visual interpretation

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

STAT 200 Chapter 1 Looking at Data - Distributions

Describing Distributions With Numbers

2011 Pearson Education, Inc

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Elementary Statistics

Determining the Spread of a Distribution

SUMMARIZING MEASURED DATA. Gaia Maselli

Variety I Variety II

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Determining the Spread of a Distribution

3.1 Measure of Center

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

a) 3 cm b) 3 cm c) cm d) cm

Describing Data: Numerical Measures

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Describing Distributions With Numbers Chapter 12

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Quantitative Tools for Research

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

Statistics in medicine

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Describing Distributions

Summarizing Measured Data

Chapter 4. Displaying and Summarizing. Quantitative Data

Describing distributions with numbers

Measures of Central Tendency

Lesson 5.4: The Normal Distribution, page 251

The empirical ( ) rule

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

Describing Data with Numerical Measures

OBJECTIVES INTRODUCTION

Frequency Distribution Cross-Tabulation

Descriptive Univariate Statistics and Bivariate Correlation

2.1 Measures of Location (P.9-11)

Data Analysis and Statistical Methods Statistics 651

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

MATH 1150 Chapter 2 Notation and Terminology

Higher Secondary - First year STATISTICS Practical Book

Describing Data: Numerical Measures. Chapter 3

Lecture 3: Chapter 3

Algebra 2. Outliers. Measures of Central Tendency (Mean, Median, Mode) Standard Deviation Normal Distribution (Bell Curves)

TOPIC: Descriptive Statistics Single Variable

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Sampling, Frequency Distributions, and Graphs (12.1)

Section 7.2 Homework Answers

Chapter 2: Tools for Exploring Univariate Data

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

MEASURES OF CENTRAL TENDENCY

Precision Correcting for Random Error

Statistics I Chapter 2: Univariate data analysis

Overview of Dispersion. Standard. Deviation

Essentials of Statistics and Probability

MATH 10 INTRODUCTORY STATISTICS

Chapter 3 Data Description

Measures of Dispersion

Introduction to statistics

Lecture 11. Data Description Estimation

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Transcription:

MEASURES OF LOCATION AND SPREAD Frequency distributions and other methods of data summarization and presentation explained in the previous lectures provide a fairly detailed description of the data and how it is distributed in the sample. In case of categorical variables this will be usually enough. But in case of quantitative variables we have more methods to summerize and present the data. Since quantitative variables are numbers (whether discrete or continuous) we can order them and summarize them in terms of how they are clustered and spread out in the sample. Quantitative variables can be summarized in terms of location of different values (measures of location or measures of central tendency) and how they are spread in the sample (measures of spread or variation) MEASURES OF LOCATION (Measures of Central Tendency) Measures of location tell us how different values of the variable are located when the data is ordered. There are three measures of location which are the median, the mode and the mean. Each of these measures has its own advantages and disadvantages which depend on the type of data being summarized. Median When we order the variables in ascending or descending way, the median is the value that divides the distribution into two equal parts so that there is the same number of observations above and below the median. For example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 To calculate the median, we rearrange the values in an ascending order. The observation number 8 (27 years) is the middle observation, i.e. there are 7 observation on either side of 27, so the median age is 27 years. When there is an even number of data values, there is no single middle value. In this case the median is calculated by the average of the central pair of values i.e. we add up the two central values and divide the result by 2. For example in table 2 there are 16 observations, there is no middle value for 16. The median fo this data will is calculated from the two values in the middle of the data i.e. observations 7 and 8: Median age =(27+ 28)/2= 55/2=27.5 years Table 1 Table 2 ID Age ID Age 1 17 1 17 2 18 2 18 3 19 3 19 4 22 4 22 5 22 5 22 6 23 6 23 7 25 7 25 8 27 8 27 9 28 9 28 10 30 10 30 11 33 11 33 12 36 12 36 13 39 13 39 14 42 14 42 15 44 15 44 16 46 Median for Frequency Distributions The median for a frequency distribution is simply the value at which the cumulative relative frequency is 50%. Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 1

Mode The mode of a distribution is simply the value that occurs most frequently. A distribution may have more than one mode. In the example above, 22 is repeated twice, so it is the mode. Mean The mean is the average of all values. The mean is calculated from the sum of all values divided by the number of observations. If we assume that each of n observations (n is the sample size) has a value xi then the mean will be: Example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 Mean age of the women= sum of all ages/n= (17+ 25+ 36+ 23+ 44+ 39+ 19+ 22+ 30+ 33+ 42+ 28+ 27+ 22+ 18)/15 =425/15= 28.3 years The mean age of the sample is 28.3 years. Mean for Frequency Distributions If we have grouped data from a frequency table and we don t have individual values, we can still calculate the mean from the grouped data by calculating the total for each interval (frequency X midpoint) and then adding up totals for all intervals and dividing the total by the sample size. If f is frequency of each interval, the mean will be calculated in the following way: Table 1 displays grouped data for Hb of 50 women. To calculate sum of each interval we first calculate the midpoint for the interval (column 3), multiply this with the frequency (colum 2) to calculate sum of the values for each interval (column 4). Mean Hb= [(4*8.5) + (7 *9.5) +(18*10.5)+ (13*11.5)+ (3*12.5)+ (4*13.5)+ (1*14.5)]/50 Mean Hb=545/50=10.9 gm Therefore the mean Hb of the 5o women is 10.9 gm. Table 1. Calculation of mean Hb of 50 women from a frequency distribution table Sum of Hb Frequency Mid-point interval 8-8.9 4 8.5 34 9-9.9 7 9.5 66.5 10-10.9 18 10.5 189 11-11.9 13 11.5 149.5 12-12.9 3 12.5 37.5 13-13.9 4 13.5 54 14 and over 1 14.5 14.5 Total 50 545 Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 2

Properties of the Mean, Median & Mode 1. The mean, mode and median will be similar if the data is normally distributed (symmetrically distributed around the mean). If the data is not normally distributed the three measures will be different. 2. The mean is sensitive to outliers; the others are not. An outlier is an extreme value, a value which is far from the rest of the values. If there are outliers in the data, the mean will be affected. The mode and the median are not affected by outliers. 3. The mode may be affected by small changes in the data but the mean and median are not affected by small changes in the data. Which measures we should use? Generally if the data distribution is not symmetrical (there are outliers) the median is a better measure of location than the mean. When we want to perform statistical analysis for inference, the mean is more flexible and useful to use. But, if the data is not symmetrically distributed (not normally distributed), even for statistical inference, we have to use the median. Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 3

MEASURES OF SPREAD If we look at a set of quantitative data displayed as a frequency distribution or a graph, we can say whether the observations are widely spread out from the mean or clustered around the mean. But this is not enough; it is usually necessary to describe this variability of the observations as a numerical value. Such a value is called a measure of spread. A measure of spread of the data along with the mean provides a better informative summary of a data set. There are 3 main ways to summarize the variability of a set of data (three measures of spread): 1. Range: gives the range of all values 2. Percentiles; reports what values are located in certain percentages of the whole data 3. The standard deviation: calculates a single numerical measure of the spread around the mean Each measure has its own advantages but the standard deviation is most useful in statistical calculations. Range The simplest way to describe the spread of a set of observations is to report the range from the minimum value to the maximum. Therefore a range tells as the lowest value and the highest value and hence the difference in-between. The problem with this is that it reports the most extreme values which may not represent the majority of the data. The actual distribution of all the values in-between these two extremes are not summarized in any way. Example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 To calculate the range we first order the values from minimum ti maximum, then we identify the smallest and the biggest value and report it. 17, 18, 19, 22, 22, 23, 25, 27, 28, 30, 33, 36, 39, 42, 44 The range is 17-44 years or 17. 44 years. This means that age of the women is spread out from 17 to 44 years, including 44. Sometimes when we report range we also report the interval (the difference between maximum and minimum). For example difference between 44 and 17 (44-17) is 27 years. Then we say range was 27 years, (17-44). Percentiles A percentile (or centile) is the value below which a given percentage of the data has occurred. For example, in the graph below of the height of a group of people, the 5% percentile is 145 cm meaning that 5% of the group had height below 145 cm. The 95% percentile is 165cm which means that 95% of the group had height below 165 cm. By specifying these two percentiles we give a range in which 90% of the data lies and thus Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 4

140 145 150 155 160 165 170 Height in cm Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 5

Standard Deviation The most common way of quantifying the variability of a distribution is to calculate its standard deviation. This method uses all the observations, by accounting for all deviations from the mean. By deviations we mean the differences between each observation and the mean. The standard deviation is a sort of average of all the deviations. Mathematically, if we say each observation has a value Xi (where i = 1 to n) then the distance from the mean value,x, will be (X -Xi). With n observations we will have n such distances. We calculate the average of these distances by summing all the observed deviations and dividing by n. Average Deviation = [ (Xi- X )]/n However, simply calculating the average deviation is not sufficient. In fact this equation will always give an average deviation of zero, because positive deviations from the mean will always exactly balance the negative deviations. What we are interested in is the magnitude of the deviations. If we square the deviations before summing them, we will always get a positive quantity. Dividing this by the total number of observations then gives a measure of average deviation from the mean, known as the variance. Variance, S² = [ (Xi- X )²]/n-1 Note. In this equation we use n-1, not n, as the denominator, because we are estimating the population variance. The problem with the variance is that it is squared, and so it is not in the same unit as the original data. For example height of individuals will be in square cm which is unit of area, not height. If we take the square root of the variance we get a measure of variability in the same units as the raw data. This quantity is called the standard deviation and tells us the average distance of all the observations in a dataset from the mean. Standard Deviation, S = [ (Xi- X )²]/n-1 Example: calculate variance and standard deviation for the following set of data on weight of 10 people in Kgs. 61, 75, 65 58, 78, 82, 70, 72, 91, 77 For calculating variance, first calculate the mean weight X = Xi/n= (61+ 75+ 65+58+78+82+70+72+91+77)/10=72.9 years Then calculate variance by the formula Variance, S² = [ (Xi- X )²]/n-1 Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 6

Variance= [58-72.9)+(61-72.9)+65-72.9)+(70-72.9)+(72-72.9)+(75-72.9)+(77-72.9)+(78-72.9)+(82-72.9)+(91-72.9)] ² /9=99.2 Then calculate standard deviation by taking the square root of the variance S= variance= 99.2=9.96 What does this mean? The standard deviation for the data was 10 Kg, meaning that on average each observation was 10 kg away from the mean (either more or less than the mean). How normal data is distributed i.e. spread out in relation to standard deviation? For data that is normally distributed: About 68% of the data lies within 1 standard deviation of the mean About 95% of the data lies within 2 standard deviations of the mean About 99% of the data lies within 3 standard deviations of the mean These proportions apply to all normal distributions, regardless of the total number of data values or the width of the distribution. The standard deviation helps to summarize the distribution of data. The standard deviation plays an important role in statistical data analysis. Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 7