ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

ST2001 2. Presenting & Summarising Data Descriptive Statistics Frequency Distribution, Histogram & Bar Chart

Summary of Previous Lecture u A study often involves taking a sample from a population that contains all subjects of interest u With random sampling each subject in the population has the same chance of being in the sample u Continuous variables take any value in a given interval u Discrete variables take values from a finite or countably infinite set u Ordinal variables consist of ranked categories u Nominal variables have no assumptions about relations between values

Aim & Objectives

Aim u Discuss a set of statistical procedures known as descriptive statistics which encompass tabular, graphical and numerical methods

Objective u Construct a frequency distribution u Draw and interpret a histogram u Distinguish different distribution shapes u Display categorical data using tables and bar charts u Summarise numerical data using measures of centrality and variability u Compute quartiles and percentiles u Interpret summary statistics u Choose appropriate summary statistics u Draw and interpret a boxplot

2.1 Motivating Exercises

Motivating Exercise 1: Summer 2009 Q1b Scenario The ages of 90 people seen in the emergency room of a Dublin hospital on a Friday night were recorded. The results are summarised in the frequency table below: Age (years) Frequency 15-20 9 20-25 13 25-30 14 30-40 48 40-50 6 50-80 6

Motivating Exercise 1: Summer 2009 Q1b Questions to Explore i. Prepare a histogram for the ages and comment on its shape. ii. Based on the histogram in (i), suggest suitable measures of centrality and spread. Include an explanation for your choice. (Note: you do not need to calculate these measures.) Thinking Ahead We will show in this section how to summarise numerical data in tabular and graphical formats

Motivating Exercise 2: Summer 2011 Q2 Scenario A study examined the moisture content of fields in West Cork. The moisture content, measured as a percentage, for a random sample of 30 fields are given below. 1.50 3.15 3.6 4.1 5.0 7.3 10.0 10.1 12.0 13.4 15.6 17.0 28.5 30.0 31.0 31.0 31.5 32.0 33.0 33.5 34.0 35.5 36.0 40.0 41.0 42.0 46.0 48.0 54.0 78.5

Motivating Exercise 2: Summer 2011 Q2 Questions to Explore u u u u u u u Calculate the mean and median. [8 marks] Calculate the quartiles and interpret these values. [12 marks] Construct a box-plot showing all steps in your calculations. [16 marks] Based on box-plot in (iii), provide suitable measures of centrality and spread. Include an explanation for your choice. (Note: you do not need to do any calculations.) [8 marks] Suppose low moisture content was defined as moisture content less than 10%. What percentage of fields would be classed as having low moisture content? [2 marks]. If you took a random sample of 200 fields in West Cork, how many low moisture content fields would you expect to find? [2 marks]. What other method of presentation would be appropriate for this data? (Note: You do not need to prepare this). [2 marks] Thinking Ahead In this section we study ways to describe the centre of quantitative data and the spread of quantitative data

2.2 Presenting Data 2.2.1 Frequency Distribution

Frequency Distribution u Data should be organised and summarised in a form that allows interpretation and analysis u Methods of presentation should give an overall feel or impression of the data at a glance u Frequency Distribution

Frequency Distribution: Example Summer 2007 Q1 u A plant scientist wants to analyse the effect of thiamine hydrochloride (vitamin B 1 ) on vegetable transplants. A sample of 50 tomato plants treated with thiamine hydrochloride is randomly selected and observations on the height of plants 14 days after treatment are recorded. The results (in cm) are 21.5 21.6 21.8 21.8 21.8 21.9 21.9 22.0 22.1 22.1 22.1 22.2 22.2 22.3 22.4 22.5 22.5 22.5 22.5 22.6 22.6 22.6 22.7 22.7 22.8 22.8 22.8 22.9 22.9 22.9 22.9 23.0 23.0 23.0 23.1 23.2 23.2 23.2 23.2 23.3 23.3 23.4 23.5 23.5 23.6 23.7 23.8 23.9 24.0 24.2

Frequency Distribution List does not inform us of u Where the data is concentrated (how high most plants are) u How spread out the data is (how much variation there is in plant height) u About the extremes (whether any plants are unusually tall or short) 21.5 21.6 21.8 21.8 21.8 21.9 21.9 22.0 22.1 22.1 22.1 22.2 22.2 22.3 22.4 22.5 22.5 22.5 22.5 22.6 22.6 22.6 22.7 22.7 22.8 22.8 22.8 22.9 22.9 22.9 22.9 23.0 23.0 23.0 23.1 23.2 23.2 23.2 23.2 23.3 23.3 23.4 23.5 23.5 23.6 23.7 23.8 23.9 24.0 24.2

Frequency Distribution u Consists of a list of class intervals and frequencies u Class intervals must be mutually exclusive - every piece of data can be placed in one class u Class intervals must be all inclusive classes together contain all the data u No. of intervals between 6 and 12 in general u For data: first class interval 21.5 to <21.9

Frequency Distribution 21.5 21.6 21.8 21.8 21.8 21.9 21.9 22.0 22.1 22.1 22.1 22.2 22.2 22.3 22.4 22.5 22.5 22.5 22.5 22.6 22.6 22.6 22.7 22.7 22.8 22.8 22.8 22.9 22.9 22.9 22.9 23.0 23.0 23.0 23.1 23.2 23.2 23.2 23.2 23.3 23.3 23.4 23.5 23.5 23.6 23.7 23.8 23.9 24.0 24.2 Class 21.5 - <21.9 21.9 - <22.3 22.3 - <22.7 22.7 - <23.1 23.1 - <23.5 23.5 - <23.9 23.9 - <24.3 Height (cm) Choose class intervals First class interval 21.5 to <21.9 Class width=0.4

Frequency Distribution Height (cm) Class Frequency 21.5 - <21.9 5 21.9 - <22.3 8 22.3 - <22.7 9 22.7 - <23.1 12 23.1 - <23.5 8 23.5 - <23.9 5 23.9 - <24.3 3 Count number in each class interval

Frequency Distribution Height (cm) Frequency 21.5 - <21.9 5 21.9 - <22.3 8 22.3 - <22.7 9 22.7 - <23.1 12 23.1 - <23.5 8 23.5 - <23.9 5 23.9 - <24.3 3 u Most plant heights are concentrated between 22.7cm and 23.1cm u Plants vary in height between 21.5cm and 24.3cm, with few towards these extremes u Minimum height is 21.5cm u Maximum height is 24.2cm (Found by looking at the original data) u No unusually tall or short plants

Frequency Distribution: In-class Exercise Autumn 2006 Q1(c) The data below are the weights (kg) of a random sample of 5 year-old children. 23 20 21 20 19 16 24 23 22 23 19 23 29 21 20 16 17 21 21 24 24 20 23 20 21 19 18 23 20 21 31 21 26 23 24 Prepare a frequency distribution of weights. Use classes of width 2 and the first class should be 16 and <18.

2.2 Presenting Data 2.2.2 Histogram

Histogram u Graphical display of a frequency distribution u Collection of bars; one for each class interval u Base of the histogram represents the class intervals u Area of bar is proportional to the frequency of class u Classes of equal width: bar heights are proportional to class frequency

Histogram: Example A microbiologist has carried out an experiment investigating the genome size of 118 common viruses. The data is to be presented using a histogram from the following frequency distribution of the genome sizes (x1000 nucleotide pairs) of viruses: Genome Sizes (x1000 nucleotide pairs) Frequency 0-20 12 20-40 16 40-50 20 50-60 24 60-70 18 70-90 20 90 8?

Histogram: Example Genome Sizes (x1000 nucleotide pairs) Frequency 0-20 12 20-40 16 40-50 20 u The smallest class width is 10 u Let this be the standard u First two classes are both of width 20, twice the standard u Heights will be their frequencies divided by 2 50-60 24 60-70 18 70-90 20 90 8

Histogram: Example Genome Sizes (x1000 nucleotide pairs) Frequency u Last class is open-ended u Need an upper limit 0-20 12 20-40 16 40-50 20 50-60 24 u Usually assume that the class is the same width as the adjacent one u In this case we would assign a limit of 110 60-70 18 70-90 20 90 8

Histogram: Example Genome Sizes (x1000 nucleotide pairs) Frequency u Nature of data might indicate a different limit 0-20 12 20-40 16 40-50 20 50-60 24 60-70 18 70-90 20 90 8 u For example, if it were known that the maximum value possible was 105, that value would be used u If we knew that the data were percentages, we would assign a limit of 100 u Open-ended classes can also occur at the lower end of the frequency distribution

Histogram: Example Genome Sizes (x1000 nucleotide pairs) Class width/10 Frequency/Multiple Class Frequency Width Multiple Height 0-20 12 20 2 6 20-40 16 20 2 8 40-50 20 10 1 20 50-60 24 10 1 24 60-70 18 10 1 18 70-90 20 20 2 10 90-110 8 20 2 4

Histogram: Example

Histogram: Example F r e q u e n c y 24 20 16 12 8 u Histograms quickly provide an idea of where the distribution of values is centred u Example: centred between 40 and 70 4 0 0 10 20 30 40 50 60 70 80 90 100 110 Genome Size ( 1000 nucleotide pairs) u Histograms also give an idea about how spread out (variable) the distribution is

Histogram: Distribution Shape u When we have a large number of observations, the classes may be made narrower u Having more classes will give a much smoother appearance to the histogram u Histogram then becomes a frequency curve u Shape of the distribution can then be assessed

Histogram: Distribution Shape Frequency Skewed to the right (Positive skew)

Histogram: Distribution Shape Frequency Skewed to the left (Negative skew)

Histogram: Distribution Shape Symmetric Frequency

Histogram: Distribution Shape In-class Exercise (a)

Histogram: Distribution Shape In-class Exercise (b)

Histogram: Distribution Shape In-class Exercise (c)

Histogram: In-class Exercise Summer 2008 Q1(a) The concentration of nicotine in milligrams (Summer Exam 2008 Q1a) was measured for 98 brands of cigarettes. The results are summarised in the frequency table below: Nicotine (mg) Frequency 0-100 2 100-150 10 150-200 15 200-250 39 250-350 14 350-500 12 500-800 6 Prepare a histogram for nicotine concentration and comment on its shape.

Histogram: In-class Exercise Summer 2008 Q1(a) 40 30 Frequency 20 10 0 0 50 Label the x and y axes 100 150 200 250 Nicotine concentration (mg) Length of bar represents height 300 350 400 450 500 550 600 650 700 750 800 Each bar represents a class

Histogram: Distribution Shape In-class Exercise Summer 2008 Q1(a) 40 30 Frequency 20 10 0 0 50 100 150 200 250 Nicotine concentration (mg) 300 350 400 450 500 550 600 650 700 750 800

2.2 Presenting Data 2.2.3 Bar Chart

Bar Chart u Bar chart is useful for illustrating a frequency distribution for a categorical variable u Each category is represented by a bar u Widths of bars are equal u Length (or height) of the bar is proportional to the frequency within the category

Bar Chart: Example u Data identifies the satisfaction rating given by 440 customers: v 105 very satisfied v 134 satisfied v 30 dissatisfied v 171 very dissatisfied

Bar Chart: Example Frequency distribution Satisfaction Level Frequency Percentage (%) Very satisfied 105 23.9 Satisfied 134 30.5 Dissatisfied 30 6.8 Very dissatisfied 171 38.9

Bar Chart: Example 40 35 30 25 Percent 20 15 10 5 0 Very satisfied Satisfied Dissatisfied Very dissatisfied Satisfaction Level

Summary u Developed a frequency distribution u Constructed and interpreted a histogram u Distinguished different distribution shapes u Displayed categorical data using tables and bar charts

What next? u Measures of centrality u Variability

2. Presenting & Summarising Data Descriptive Statistics Measures of Centrality & Variability

Summary of Previous Lecture u For quantitative (numerical) variables (discrete and continuous variables in which numbers are recorded), a frequency table is developed and the data is displayed using a histogram u The histogram shows the distribution shape of the data, such as whether the distribution is bell shaped, skewed to the right (longer tail pointing to the right) or skewed to the left (longer tail pointing to the left) u For categorical variables (ordinal and nominal variables in which categories are recorded), data are summarised using a frequency table and displayed using bar charts

Aim & Objectives

2.3 Measures of Centrality

Measures of Centrality u A measure of centrality (or location) is used to indicate where the central tendency or the typical value of a sample (or population) lies u Two commonly used measures of centrality v Mean v Median

Mean u Mean (or arithmetic mean) is the most familiar and most useful average u It is calculated by summing all observations and dividing by the number of observations: x Population Mean: µ = N Sample Mean: x = n x

Mean: Example u A microbiologist is investigating the size of a certain type of cell. The following are the diameters (in µm) of a sample of 10 cells: 1.2, 2.3, 2.9, 3.4, 3.5, 3.5, 4.0, 4.1, 4.9, 5.0. What is the typical diameter of such cells?

Mean: Example Sample of 10 cells 1.2, 2.3, 2.9, 3.4, 3.5, 3.5, 4.0, 4.1, 4.9, 5.0 Solution: Compute the mean x = x n = 1.2 + 2.3 + 2.9 + 3.4 + 3.5 + 3.5 + 4.0 + 4.1 + 4.9 + 5.0 10 = 3.48 µm

Median u Median is the middle observation in a list of observations in increasing order u Median (Med) is the value in position (n+1)/2

Median: Example u A veterinary pharmaceutical company has devised a formulation against canine ticks. To determine whether the formulation works, the company needs to first know how many ticks would be found in a dog s coat before treatment. These are the numbers of ticks counted in the coats of a sample of 9 dogs: 2, 3, 3, 4, 7, 9, 10, 10, 217

Median: Example u Numbers of ticks counted in the coats of a sample of 9 dogs: 2, 3, 3, 4, 7, 9, 10, 10, 217 u (n+1)/2=(9+1)/2=10/2=5 position of median in ordered list u Middle value in this ordered list of 9 observations is the 5th value u Med = 7

Median u When the list contains an even number of observations, there is no single middle value u In this case the median is taken to be mid-way between the 2 middle values

Median: Example u Consider a different sample, this time of 8 dogs: 3, 3, 4, 6, 9, 10, 10, 196 u Calculate the median

Median: Example u There are 8 values in the list 3, 3, 4, 6, 9, 10, 10, 196 u There is no middle value u Two middle values in positions 4 (n/2) and 5 ((n/2)+1) u Values are 6 and 9 u Med = 6 + ½ (9-6) = 7.5

Relationship between Mean and Median u Example 1 sample of 9 dogs: 2, 3, 3, 4, 7, 9, 10, 10, 217 u Example 2 sample of 8 dogs: 3, 3, 4, 6, 9, 10, 10, 196 u One very high value in these two examples u They had no effect on the median v Example 1 Med=7 v Example 2 Med=7.5 u Median is robust to extreme observations

Relationship between Mean and Median u When extreme values occur in a set of observations, the median is the more appropriate measure of central tendency u Mean will be strongly affected by extreme values u Extreme values will drag the mean towards them u Example 2 sample of 8 dogs: 3, 3, 4, 6, 9, 10, 10, 196 u Mean 30.13 but the median is 7.5

Relationship between Mean and Median u Example 2 sample of 8 dogs: 3, 3, 4, 6, 9, 10, 10, 196 u Mean 30.13 but the median is 7.5 u Which of these give a more reasonable measure of the typical number of ticks in a dog s coat? Answer: v Median of 7.5 as it is unaffected by the extreme value u Biological data frequently contain one or two extreme observations (usually large rather than small)

Relationship between Mean and Median u Similarity of the mean and median depends on the shape of the distribution Frequency Skewed to the right (Positive skew) Median Mean

Relationship between Mean and Median Frequency Skewed to the left (Negative skew) Mean Median

Relationship between Mean and Median Frequency Symmetric Mean=Median

Relationship between Mean and Median: In-class Exercise

Relationship between Mean and Median: In-class Exercise Summer 2008 Q1(a) 40 u Would the median be greater than the mean? 30 u Which would be the best Frequency 20 10 0 0 50 100 150 200 250 Nicotine concentration (mg) 300 350 400 450 500 550 600 650 700 750 800 measure of centrality? u Why?

2.4 Variability

Variability Same means Different variability

Variability u Spread u Dispersion u Variation Most commonly used measures of spread u Range u Variance u Standard Deviation

Range u Range is the difference between the largest and smallest observations u Range=maximum-minimum u Only uses two values v Extreme values u May not indicate true variability u Influenced by outliers

Range: Example u A microbiologist is interested in the cell division rates of a strain of bacteria. Calculate the range for the following sample of times (hours) for bacteria to double in size: 1, 2, 3, 3, 5, 9, 10, 41 u Range=maximum-minimum=41 1 = 40 hours

Range u Advantage of using the range is that it is simple to calculate u Only uses the two most extreme values u No information is used from the other observations

Range u Consider the previous example of sample of times (hours) for bacteria to double in size: 1, 2, 3, 3, 5, 9, 10, 41 Range=41-1=40 u Sensitive to values that are extreme, relative to adjacent value (outliers) u If there are outliers, the range can give a distorted measure of dispersion u Range is not robust

Variance u Variance uses every value in its calculation u To calculate the variance, the deviation from the mean of each observation is computed u For a distribution with little dispersion, most values will be close to the mean v Most deviations will be small u For a distribution with greater dispersion, many values will be far from the mean v Many deviations will be large

Variance u Deviation from the mean of each observation is calculated u Deviations are squared, summed and then divided by n-1 if the observations are from a sample, or N if the observations are from a population 2 population variance: σ = sample variance: s 2 = ( x x) n 1 ( x µ ) N 2 2 u Sample variance s 2 is an estimate of the population variance σ 2

Variance: Example u Calculate the variance for the following sample of butterfly wing lengths (mm): 23, 25, 29, 35, 41, 47, 52 2 ( x x) s = n 1 2

Variance: Example Data: 23, 25, 29, 35, 41, 47, 52 2 ( x x) s = n 1 2 x x = = n 252 = 36 7 2 ( x 36) s = 6 2 = [(23-36) + (25 36) + (29 36) + (35 36) 2 2 2 2 2 2 2 +(41-36) + (47 36) + (52 36) ] 6 = [169+121+49+1+25+121+256] 6 = 742 6 2 = 123.67 mm

Variance: In-class Exercise For the following data calculate the variance 3, 5, 6, 8, 6, 7, 7

Standard Deviation u Standard deviation is closely related to the variance u Standard deviation is the square root of the variance u Standard deviation is usually preferred to the variance because its units are the same as those of the data v Butterfly wing lengths-unit is mm; unit of standard deviation is also in mm u Sample standard deviation s is an estimate of the population standard deviation σ

Standard Deviation Warning: u Calculators provide two versions of s v Sample and population standard deviations v Formula for the population standard deviation is slightly different (the divisor is N, rather than n-1). u These keys give the sample standard deviation v S v σ n-1

In-class Exercise Sumer 2010 Q2i,v A study examined the distance (in km) between student s accommodation and their university. The results for 29 full-time students are given below. The study was restricted to full-time students. 0.15 0.25 0.30 0.40 0.45 1.00 1.24 1.30 1.50 1.70 1.75 2.00 2.10 2.30 2.35 2.40 3.00 3.40 3.55 3.60 3.70 4.00 4.20 4.25 5.60 6.10 6.90 7.80 8.50 u Calculate the mean and median. u Suppose an additional student who lived 20km from his university was added to the dataset. What effect would this have on the mean and median?

Summary u Calculated mean u Computed median u Examined the relationship between mean and median u Determined how to choose between mean and median u Described range u Calculated variance & standard deviation

What next? u Compute quartiles and percentiles u Draw and interpret a boxplot

2. Presenting & Summarising Data Descriptive Statistics Quartiles & Boxplots

Summary of Previous Lecture u For numerical variables, measures of centrality, the mean which is the sum of the observations divided by the number of observations and the median which divides the ordered data into two parts of equal numbers of observations were described u The median is a more representative summary than the mean when the data are highly skewed u The range is the difference between the largest and smallest observations. It uses only the two extreme values (minimum and maximum values) u The standard deviation describes the typical deviation from the mean

Aim & Objectives

2.5 Quartiles and Percentiles

Recall the median u Median splits a list of ordered values or the distribution into 2 halves v 50% of values lie below Q 2 v Q 2 is the median Q 2

Quartiles u Median splits a list of ordered values or the distribution into 2 halves u Quartiles split the distribution into 4 quarters v Each quarter has the same number of observations u Quartiles are denoted by Q 1, Q 2 and Q 3 v 25% of values lie below Q 1 v 50% of values lie below Q 2 v 75% of values lie below Q 3 v Q 2 is the median Q 1 Q 2 Q 3

Quartiles u In an ordered (increasing) list of data Q 1 is in the ¼(n+1) th position Q 3 is in the ¾(n+1) th position u Unless n+1 is divisible by 4, the quartiles cannot be calculated directly u As with the median, we may need to interpolate between values

Quartiles: Example u Due to increased shipping traffic in the Straights of Gibraltar, it is feared that sightings of bottle-nosed dolphins may decrease. A baseline study of the species has been carried out. The following are the numbers of sightings of different pods of bottle-nosed dolphins per day over a 2-week period: 4, 6, 7, 8, 8, 9, 12, 13, 14, 16, 16, 19, 20, 22 Calculate the quartiles

Quartiles: Example u Data 4, 6, 7, 8, 8, 9, 12, 13, 14, 16, 16, 19, 20, 22 Find Q 1 u n = 14 u Q 1 is in the ¼(14+1) = 3.75 th position u This position lies 0.75 times the distance between the 3 rd and 4 th positions u Q 1 = 7 + 0.75 (8-7) u = 7.75 3 rd position 4 th position

Quartiles: Example u Data 4, 6, 7, 8, 8, 9, 12, 13, 14, 16, 16, 19, 20, 22 Find Q 2 u n = 14 7 th position 8 th position u Q 2 is in the ½(14+1) = 7.5 th position u This position lies 0.5 times the distance between the 7 th and 8 th positions u Q 2 = 12 + 0.5 (13-12) u = 12.5

Quartiles: Example u Data 4, 6, 7, 8, 8, 9, 12, 13, 14, 16, 16, 19, 20, 22 Find Q 3 u n = 14 11 th position 12 th position u Q 3 is in the ¾(14+1) = 11.25 th position u This position lies 0.25 times the distance between the 11 th and 12 th positions u Q 3 = 16 + 0.25 (19-16) u = 16.75

Quartiles: Example u Q 1 = 7.75 u Q 2 = 12.5 u Q 3 = 16.75 u What is the interpretation of the quartiles?

Quartiles: Example u Q 1 = 7.75 25% of days had 7.75 or less sightings of different pods of bottle-nosed dolphins u Q 2 = 12.5 50% of days had 12.5 or less sightings of different pods of bottle-nosed dolphins u Q 3 = 16.75 75% of days had 16.75 or less sightings of different pods of bottle-nosed dolphins

Quartiles: In-class Exercise Summer 2010 Q2ii A study examined the distance (in km) between student s accommodation and their university. The results for 29 full-time students are given below. The study was restricted to full-time students. 0.15 0.25 0.30 0.40 0.45 1.00 1.24 1.30 1.50 1.70 1.75 2.00 2.10 2.30 2.35 2.40 3.00 3.40 3.55 3.60 3.70 4.00 4.20 4.25 5.60 6.10 6.90 7.80 8.50 Calculate the quartiles and interpret the values.

Quartiles u Symmetric distribution Q 2 is equidistant from Q 1 and Q 3 Q 2 Q 1 = Q 3 Q 2 Q 1 Q 2 Q 3

Interquartile Range u Interquartile range (IQR = Q 3 Q 1 ) is an alternative measure of variability to the Range u It is a modified range that is robust to extreme values u Quartiles are special cases of percentiles

Percentiles u Percentiles split a distribution into 100 parts u Percentile P x is the value below which lies x% of the distribution u P x lies in the (x(n+1))/100)th position in an ordered list u It may be necessary to interpolate between values to calculate the percentiles

2.6 Boxplots

Boxplots u Boxplots are useful for presenting data v Presenting and summarising data u Provides an impression of the location and dispersion u Used to identify outliers (extreme values that are incompatible with the rest of the values) u Construction of a boxplot involves computing quartiles and the IQR

Boxplots u A box extending from Q 1 to Q 3 u A line through the box at Q 2 u Lines extending from the box to the values just inside a length of 1.5xIQR (known as adjacent values) u An identifier for each observation beyond these lines (outliers)

Boxplots Lower adjacent value Observation > Q 1-1.5 IQR IQR Upper adjacent value Observation < Q 3 +1.5 IQR Q 1 Q 3 Q 1 Q 2 Q 3

Boxplots: Example u Construct a Boxplot for the cell diameters (nm) of a sample of a type of virus 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32

Boxplots: Example Lower adjacent value Observation > Q 1-1.5 IQR IQR Upper adjacent value Observation < Q 3 +1.5 IQR Q 1 Q 2 Q 3

Boxplots: Example 6 th position 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Data; Find Q 1 u n = 23 u Q 1 is in the ¼(23+1) th position = 6 th position u Q 1 = 16

Boxplots: Example 12 th position 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Data; Find Q 2 u n = 23 u Q 2 is in the ½(23+1) th position = 12 th position u Q 2 = 18

Boxplots: Example 18 th position 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Data; Find Q 3 u n = 23 u Q 3 is in the ¾(23+1) th position = 18 th position u Q 3 = 21

Boxplots: Example u Q 1 = 16, Q 2 = 18, Q 3 = 21 u IQR = Q 3 - Q 1 = 21 16 = 5 u 1.5 x IQR = 1.5 x 5 = 7.5 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Q 1-1.5 x IQR = 16-7.5 = 8.5 u Q 3 + 1.5 x IQR = 21 + 7.5 = 28.5

Boxplots: Example 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Q 1 = 16, Q 2 = 18, Q 3 = 21, 1.5xIQR=7.5 u Q 1-1.5 x IQR = 16-7.5 = 8.5 u Lower adjacent value = observation > Q 1-1.5 x IQR u Lower adjacent value = observation > 8.5 = 12

Boxplots: Example 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Q 1 = 16, Q 2 = 18, Q 3 = 21, 1.5xIQR=7.5 u Q 3 + 1.5 x IQR = 21 + 7.5 = 28.5 u Upper adjacent value = observation < Q 3 + 1.5 x IQR u Upper adjacent value = observation < 28.5 = 25

Boxplots: Example u Q 1 = 16, Q 2 = 18, Q 3 = 21 u Lower adjacent value = 12 u Upper adjacent value = 25 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 u Observations > 25 or < 12 are outliers u One outlier at 32, identify this observation with a symbol such as an asterisk (*) or circle

Boxplots: Example Diameter (nm) 35.0 30.0 25.0 20.0 15.0 10.0 Box-plot of the Diameter (nm) of a virus u Distribution is centred at 18nm u Distribution is not symmetric, but slightly skewed to the right u 50% of the distribution lies between 16 and 21nm u One outlier, a value of 32nm

Boxplots: Construction u Compute Q 1, Q 2, Q 3 u Calculate interquartile range (IQR) Q 3 - Q 1 u Locate lower and upper adjacent values v Lower adjacent value = observation > Q 1-1.5 x IQR v Upper adjacent value = observation < Q 3 + 1.5 x IQR u Identified outliers v Values > Upper adjacent value v Values < Lower adjacent values u Form box between Q 1 and Q 3 ; draw a line at Q 2 u Extend whiskers to lower and adjacent values u Mark outliers

Boxplots: Construction Lower adjacent value Observation > Q 1-1.5 IQR IQR Upper adjacent value Observation < Q 3 +1.5 IQR Q 1 Q 2 Q 3

Boxplots u Boxplots are useful for comparing two or more distributions u To compare the distributions of two samples, two boxplots could be prepared on the same axes u What can be compared between distributions v Locations of the distributions v Minimum and maximum values v Dispersions v Shapes of the distributions

Boxplots: Example Cell diameters (nm) of samples of 2 types of virus. Type 1: 12 13 14 14 15 16 17 17 18 18 18 18 19 19 19 20 21 21 22 23 23 25 32 Type 2: 15 15 16 17 18 19 20 22 23 24 24 25 26 27 28 30 30 31 33 34 36 37 38 39 40

Boxplots: Example 45.0 Box-plot of Diameters (nm) of 2 types of virus 38.0 Diameter (nm) 31.0 24.0 17.0 10.0 Type 1 Type 2

Boxplots In-class Exercise Summer 2010 Q2iii,iv,vi A study examined the distance (in km) between student s accommodation and their university. The results for 29 full-time students are given below. The study was restricted to full-time students. 0.15 0.25 0.30 0.40 0.45 1.00 1.24 1.30 1.50 1.70 1.75 2.00 2.10 2.30 2.35 2.40 3.00 3.40 3.55 3.60 3.70 4.00 4.20 4.25 5.60 6.10 6.90 7.80 8.50 Construct a box-plot showing all calculations. Based on box-plot, provide suitable measures of centrality and spread. Include an explanation for your choice. (Note: you do not need to do any calculations.) What other method of presentation would be appropriate for this data? (Note: You do not need to prepare this).

Summary u Computed quartiles v Q 1 v Q 2 v Q 3 u Identified outliers u Constructed boxplots u Examined distribution shape

What next? u Motivating Exercises

2. Presenting & Summarising Data Descriptive Statistics Motivating Exercises

Summary of Previous Lecture u The interquartile range (IQR) presents the lower quartile to the upper quartile spanning the middle half of the data. u IQR is a more resistant measure of spread as it is unaffected by extreme observations. u When data are highly skewed, the standard deviation has no meaning. u The five number summary of a dataset consists of the minimum value, first quartile, median, third quartile and maximum value, and forms the basis of the boxplot. u The boxplot provides information about centrality (by the median), spread (by the interquartile range, first quartile to third quartile) and outliers (values more than 1.5 x IQR below the first quartile or above the third quartile). u An outlier is an extreme value falling far below or above the bulk of the data.

2.1 Motivating Exercises

u Probability What next?