SAMPLE. Describing the distribution of a single variable

Size: px

Start display at page:

Download "SAMPLE. Describing the distribution of a single variable"

Nicholas Barber
5 years ago
Views:

Objectives C H A P T E R 22 Describing the distribution of a single variable To introduce the two main types of data categorical and numerical To use bar charts to display frequency distributions of

1 Objectives C H A P T E R 22 Describing the distribution of a single variable To introduce the two main types of data categorical and numerical To use bar charts to display frequency distributions of categorical data To use histograms and frequency polygons to display frequency distributions of numerical data To use cumulative frequency polygons and cumulative relative frequency polygons to display cumulative frequency distributions To use the stem-and-leaf plot to display numerical data To use the histogram to display numerical data To use these plots to describe the distribution of a numerical variable in terms of symmetry, centre, spread and outliers To define and calculate the summary statistics mean, median, range, interquartile range, variance and standard deviation To understand the properties of these summary statistics and when each is appropriate To construct and interpret boxplots, and use them to compare data sets 22.1 Types of variables A characteristic about which information is recorded is called a variable, because its value is not always the same. Several types of variable can be identified. Consider the following situations. 500

2 Chapter 22 Describing the distribution of a single variable 501 Students answer a question by selecting yes, no or don t know. Students say how they feel about a particular statement by ticking one of strongly agree, agree, no opinion, disagree or strongly disagree. Students write down the size shoe that they take. Students write down their height. These situations give rise to two different types of data. The data arising from the first two situations are called categorical data, because the data can only be classified by the name of the category from which they come; there is no quantity associated with each category. The data arising from the third and fourth examples is called numerical data. These examples differ slightly from each other in the type of numerical data they each generate. Shoe sizes are of the form..., 6, 6.5, 7, 7.5,... These are called discrete data, because the data can only take particular values. Discrete data often arise in situations where counting is involved. The other type of numerical data is continuous data where the variable may take any value (sometimes within a specified interval). Such data arise when students measure height. In fact, continuous data often arise when measuring is involved. Exercise 22A 1 Classify the data which arise from the following situations into categorical, or numerical. a Kindergarten pupils bring along their favourite toy, and they are grouped together under the headings: dolls, soft toys, games, cars, and other. b The number of students on each of twenty school buses are counted. c Agroup of people each write down their favourite colour. d Each student in a class is weighed in kilograms. e Each student in a class is weighed and then classified as light, average or heavy. f People rate their enthusiasm for a certain rock group as low, medium, or high. 2 Classify the data which arise from the following situations as categorical or numerical. a The intelligence quotient (IQ) of a group of students is measured using a test. b Agroup of people are asked to indicate their attitude to capital punishment by selecting a number from 1 to 5 where 1 = strongly disagree, 2 = disagree, 3 = undecided, 4 = agree, and 5 = strongly agree. 3 Classify the following numerical data as either discrete or continuous. a The number of pages in a book. b The price paid to fill the tank of a car with petrol. c The volume of petrol used to fill the tank of a car. d The time between the arrival of successive customers at an autobank teller. e The number of tosses of a die required before a six is thrown.

3 502 Essential Advanced General Mathematics 22.2 Displaying categorical data the bar chart Suppose a group of 130 students were asked to nominate their favourite kind of music under the categories hard rock, oldies, classical, rap, country or other. The table shows the data for the first few students. Student s name Daniel Karina John Jodie Favourite music hard rock classical country hard rock The table gives data for individual students. To consider the group as a whole the data should be collected into a table called a frequency distribution by counting how many of each of the different values of the variable have been observed. Counting the number of students who responded to the question on favourite kinds of music gave the following results in each category. Hard rock Other Oldies Classical Rap Country While a clear indication of the group s preferences can be seen from the table, a visual display may be constructed to illustrate this. When the data are categorical, the appropriate display is a bar chart. The categories are indicated on the horizontal axis and the corresponding numbers in each category shown on the vertical axis. Number of students Hard rock Other Oldies Classical Rap Country Type of music The order in which the categories are listed on the horizontal axis is not important, as no order is inherent in the category labels. In this particular bar chart, the categories are listed in decreasing order by number. From the bar chart the music preferences for the group of students may be easily compared. The value which occurs most frequently is called the mode of the variable. Here it can be seen that the mode is hard rock.

4 Chapter 22 Describing the distribution of a single variable 503 Exercise 22B 1 Agroup of students were asked to select their favourite type of fast food, with the following results. a b Draw a bar chart for these data. Which is the most popular food type? 2 The following responses were received to a question regarding the return of capital punishment. a Draw a bar chart for these data. b How many respondents either agree or strongly agree? 3 A video shop proprietor took note of the type of films borrowed during a particular day with the following results. a b Construct a bar chart to illustrate these data. Which is the least popular film type? 4 A survey of secondary school students preferred ways of spending their leisure time at home gave the following results. a b Construct a bar chart to illustrate these data. What is the most common leisure activity? Food type Number of students hamburgers 23 chicken 7 fish and chips 6 Chinese 7 pizza 18 other 8 strongly agree 21 agree 11 don t know 42 disagree 53 strongly disagree 129 comedy 53 drama 89 horror 42 music 15 other 33 watch TV 42% read 13% listen to music 23% watch a video 12% phone friends 4% other 6% 22.3 Displaying numerical data the histogram In previous studies you have been introduced to various ways of summarising and displaying numerical data, including dotplots, stem-and-leaf plots, histograms and boxplots. Constructing a histogram for discrete numerical data is demonstrated in Example 1.

5 504 Essential Advanced General Mathematics Example 1 The numbers of siblings reported by each student in Year 11 at a local school is as follows: Construct a frequency distribution of the number of siblings. Solution To construct the frequency distribution count the numbers of students corresponding to each of the numbers of siblings, as shown. Number Frequency A histogram looks similar to a bar chart, but because the data are numeric there is a natural order to the plot which may not occur with a bar chart. Usually for discrete data the actual data values are located at the middle of the appropriate column, as shown. Frequency Number of siblings An alternative display for a frequency distribution is a frequency polygon. Itisformed by plotting the values in the frequency histogram with points, which are then joined by straight lines. A frequency polygon for the data in Example 1 is shown by the red line in this diagram. Frequency Number of siblings When the range of responses is large it is usual to gather the data together into sub-groups or class intervals. The number of data values corresponding to each class interval is called the class frequency.

6 Chapter 22 Describing the distribution of a single variable 505 Class intervals should be chosen according to the following principles: Every data value should be in an interval The intervals should not overlap There should be no gaps between the intervals. The choice of intervals can vary, but generally a division which results in about 5 to 15 groups is preferred. It is also usual to choose an interval width which is easy for the reader to interpret, such as 10 units, 100 units, 1000 units etc (depending on the data). By convention, the beginning of the interval is given the appropriate exact value, rather than the end. For example, intervals of 0 49, 50 99, would be preferred over the intervals 1 50, , etc. Example 2 A researcher asked a group of people to record how many cups of coffee they drank in a particular week. Here are her results Construct a frequency distribution and hence a histogram of these data. Solution Because there are so many different results and they are spread over a wide range, the data are summarised into class intervals. As the minimum value is 0 and the Number of Frequency maximum is 34, intervals of width 5 cups of coffee would be appropriate, giving the frequency distribution shown in the table The corresponding histogram may then be drawn Frequency Number of cups of coffee Example 2 was concerned with a discrete numerical variable. When constructing a frequency distribution of continuous data, the data are again grouped, as shown in Example 3.

7 506 Essential Advanced General Mathematics Example 3 The following are the heights of the players in a basketball club, measured to the nearest millimetre Construct a frequency distribution and hence a histogram of these data. Solution From the data it seems that intervals of width 5 will be suitable. All values of the variable which are 170 or more, but less than 175, have been included in the first interval. The second interval includes values from 175 to less than 180, and so on for the rest of the table. The histogram of these data is shown here. Frequency Player heights Frequency Player heights The interval in a frequency distribution which has the highest class frequency is called the modal class. Here the modal class is Using the TI-Nspire The calculator can be used to construct a histogram for numerical data. This will be illustrated using the basketball player height data from Example 3.

Chapter 22 Describing the distribution of a single variable 507 The data is easiest entered in a Lists &

Then enter each of the 41 numbers as shown. Open a Data & Statistics application ( 5) tograph the data.

Specify the x variable by selecting Add X Variable from the Plot Properties (b 2 4) and selecting height.

(Note: It is also possible to use the NavPad to move down below the x-axis and click to add the x variable.

8 Chapter 22 Describing the distribution of a single variable 507 The data is easiest entered in a Lists & Spreadsheet application ( 3). Firstly, use the up/down arrows ( )to name the first column height. Then enter each of the 41 numbers as shown. Open a Data & Statistics application ( 5) tograph the data. At first the data displays as shown. Specify the x variable by selecting Add X Variable from the Plot Properties (b 2 4) and selecting height. The data now displays as shown. (Note: It is also possible to use the NavPad to move down below the x-axis and click to add the x variable.) Select Histogram from the Plot Type menu (b 13). The data now displays as shown. Select Bin Settings from the Histogram Properties submenu of Plot Properties menu (b 222). Let width = 5 and Alignment = 170. Finally, select Zoom, Data from the Window/Zoom menu (b 5 2)to display the data as shown.

9 508 Essential Advanced General Mathematics Using the Casio ClassPad The calculator can be used to construct a histogram for numerical data. This will be illustrated using the basketball player height data from Example 3. In enter the data into list1, tapping EXE to enter and move down the column. Tap SetGraph, Setting...and the tab for Graph 1, enter the settings shown and tap SET. Tap SetGraph, StatGraph1 and then tap the box to tick and select the graph. Tap to produce the graph selecting HStart = 4 (the left bound of the histogram) and HStep = 4 (the desired interval width) when prompted. The histogram is produced as shown. With the graph window selected (bold border) tap 6 to adjust the viewing window for the graph. Tap Analysis, Trace and use the navigator key to move from column to column and display the count for that column.

10 Chapter 22 Describing the distribution of a single variable 509 Relative and percentage frequencies When frequencies are expressed as a proportion of the total number they are called relative frequencies.byexpressing the frequencies as relative frequencies more information is obtained about the data set. Multiplying the relative frequencies by 100 readily converts them to percentage frequencies,which are easier to interpret. An example of the calculation of relative and percentage frequencies is shown in Example 4. Example 4 Construct a relative frequency distribution and a percentage frequency distribution for the player height data. Solution From this table it can be seen, for example, that nine out of forty-one, or 22% of players, have heights from 185 cm to less than 190 cm. Player Relative Percentage heights (cm) Frequency frequency frequency = % 5 = % 13 = % 9 = % 7 = % 1 = % 2 = % Both the relative frequency histogram and the percentage frequency histogram are identical to the frequency histogram only the vertical scale is changed. To construct either of these histograms from a list of data use a graphics calculator to construct the frequency histogram, and then convert the individual frequencies to either relative frequencies or percentage frequencies one by one as required. Cumulative frequency distribution To answer questions concerning the number or proportion of the data values which are less than a given value a cumulative frequency distribution, oracumulative relative frequency distribution can be constructed. In both a cumulative frequency distribution and a cumulative relative frequency distribution, the number of observations in each class are accumulated from low to high values of the variable.

11 510 Essential Advanced General Mathematics Example 1 Example 5 Construct a cumulative frequency distribution and a cumulative relative frequency distribution for the data in Example 4. Solution Player heights Cumulative Cumulative relative (cm) Frequency frequency frequency < < < < < < < < Each cumulative frequency was obtained by adding preceding values of the frequency. In the same way the cumulative relative frequencies were obtained by adding preceding relative frequencies. Thus it can be said that a proportion of 0.54, or 54%, of players are less than 185 cm tall. Agraphical representation of a cumulative frequency distribution is called a cumulative frequency 40 polygon and has a distinctive appearance, as it 30 always starts at zero and is non-decreasing. This graph shows, on the vertical axis, the 20 number of players shorter than any height 10 given on the horizontal axis. The cumulative relative frequency distribution could also be 0 plotted as a cumulative relative frequency Player heights polygon,which would differ from the cumulative frequency polygon only in the scale on the vertical axis, which would run from 0 to 1. Exercise 22C Cumulative frequency The number of pets reported by each student in a class is given in the following table: Construct a frequency distribution of the numbers of pets reported by each student.

12 Chapter 22 Describing the distribution of a single variable The number of children in the family for each student in a class is shown in this histogram. Example 2 Example 4 a b c d Number of students Size of family How many students are the only child in a family? What is the most common number of children in the family? How many students come from families with six or more children? How many students are there in the class? 3 The following histogram gives the scores on a general knowledge quiz for a class of Year 11 students. Number of students Marks a How many students scored from marks? b How many students attempted the quiz? c What is the modal class? d If a mark of 50 or more is designated as a pass, how many students passed the quiz? 4 The maximum temperatures for several capital cities around the world on a particular day, in degrees Celsius, were: a Use a class interval of 5 to construct a frequency distribution for these data. b Construct the corresponding relative frequency distribution. c Draw a histogram from the frequency distribution. d What percentage of cities had a maximum temperature of less than 25 C?

13 512 Essential Advanced General Mathematics Example 3 Example 5 5 A student purchases 21 new text books from a school book supplier with the following prices (in dollars) a Draw a histogram of these data using appropriate class intervals. b What is the modal class? c Construct a cumulative frequency distribution for these data and draw the cumulative frequency polygon. 6 Agroup of students were asked to draw a line which they estimated to be the same length as a 30 cm ruler. The lines were then measured (in cm) with the following results a Construct a histogram of the frequency distribution. b Construct a cumulative frequency distribution for these data and draw the cumulative frequency polygon. c Write a sentence to describe the students performance on this task. 7 The following are the marks obtained by a group of Year 11 Chemistry students on the end of year exam a Using a graphics calculator, or otherwise, construct a histogram of the frequency distribution. b Construct a cumulative frequency distribution for these data and draw the cumulative frequency polygon. c Write a sentence to describe the students performance on this exam. 8 The following 50 values are the lengths (in metres) of some par 4 golf holes from Melbourne golf courses a b Construct a histogram of the frequency distribution. Construct a cumulative frequency distribution for these data and draw the cumulative frequency polygon.

14 c Chapter 22 Describing the distribution of a single variable 513 Use the cumulative frequency polygon to estimate: i the proportion of par 4 holes below 300 m in length ii the proportion of par 4 holes 360 m or more in length iii the length which is exceeded by 90% of the par 4 holes Characteristics of distributions of numerical variables Distributions of numerical variables are characterised by their shapes and special features such as centre and spread. Two distributions are said to differ in centre if the values of the variable in one distribution are generally larger than the values of the variable in the other distribution. Consider, for example, the following histograms shown on the same scale. a b It can be seen that plot b is identical to plot a but moved horizontally several units to the right, indicating that these distributions differ in the location of their centres. The next pair of histograms also differ, but not in the same way. While both histograms are centred at about the same place, histogram d is more spread out. Two distributions are said to differ in spread if the values of the variable in one distribution tend to be more spread out than the values of the variable in the other distribution. c d A distribution is said to be symmetric if it forms a mirror image of itself when folded in the middle along a vertical axis; otherwise it is said to be skewed. Histogram e is perfectly symmetrical, while f shows a distribution which is approximately symmetric. e f

15 514 Essential Advanced General Mathematics If a histogram has a short tail to the left and a long tail pointing to the right it is said to be positively skewed (because of the many values towards the positive end of the distribution) as shown in the histogram g. If a histogram has a short tail to the right and a long tail pointing to the left it is said to be negatively skewed (because of the many values towards the negative end of the distribution), as shown in histogram h. g h positively skewed negatively skewed Knowing whether a distribution is skewed or symmetric is important as this gives considerable information concerning the choice of appropriate summary statistics, as will be seen in the next section. Exercise 22D 1 Do the following pairs of distributions differ in centre, spread, both or neither? a b c

16 Chapter 22 Describing the distribution of a single variable Describe the shape of each of the following histograms. a b 0 0 c 0 3 What is the shape of the histogram drawn in 6, Exercise 22C? 4 What is the shape of the histogram drawn in 7, Exercise 22C? 5 What is the shape of the histogram drawn in 8, Exercise 22C? 22.5 Stem-and-leaf plots An informative data display for a small (less than 50 values) numerical data set is the stem-and-leaf plot. The construction of the stem-and-leaf plot is illustrated in Example 6. Example 6 By the end of 2004 the number of test matches played, as captain, by each of the Australian cricket captains was: Construct a stem-and-leaf plot of these data.

17 516 Essential Advanced General Mathematics Solution To make a stem-and-leaf plot find the smallest and the largest data values. From the table above, the smallest value is 1, which is given a0intheten s column, and the largest is 93, which has a 9 in the ten s column. This means that the stems are chosen to be from 0 9. These are written in a column with avertical line to their right, as shown. The units for each data point are then entered to the right of the dividing line. They are entered initially in the order in which they appear in the data. When all data points are entered in the table, the stem-and-leaf plot looks like this To complete the plot the leaves are ordered, and a key added to specify the place value of the stem and the leaves indicates 39 matches It can be seen from this plot that one captain has led Australia in many more test matches than any other (Allan Border, who captained Australia in 93 test matches). When a value sits away from the main body of the data it is called an outlier

18 Chapter 22 Describing the distribution of a single variable 517 Stem-and-leaf plots have the advantage of retaining all the information in the data set while achieving a display not unlike that of a histogram (turned on its side). In addition, a stem-and-leaf plot clearly shows: the range of values where the values are concentrated the shape of the data set whether there are any gaps in which no values are observed any unusual values (outliers). Grouping the leaves in tens is simplest other convenient groupings are in fives or twos, as shown in Example 7. Example 7 The birth weights, in kilograms, of the first 30 babies born at a hospital in a selected month are as follows Construct a stem-and-leaf plot of these data. Solution A stem-and-leaf plot of the birth weights, with the stem representing units and the leaves representing one-tenth of a unit, may be constructed indicates 3.0 kilograms The plot, which allows one row for each different stem, appears to be too compact. These data may be better displayed by constructing a stem-and-leaf plot with two rows for each stem. These rows correspond to the digits {0, 1, 2, 3, 4} in the first row and {5, 6, 7, 8, 9} in the second row indicates 3.0 kilograms The only other possibility for a stem-and-leaf plot is one which has five rows per stem. These rows correspond to the digits {0, 1}, {2, 3}, {4, 5}, {6, 7} and {8, 9}.

19 518 Essential Advanced General Mathematics indicates 3.0 kilograms None of the stem-and-leaf displays shown are correct or incorrect. A stem-and-leaf plot is used to explore data and more than one may need to be constructed before the most informative one is obtained. Again, from 5 to 15 rows is generally the most helpful, but this may vary in individual cases. When the data have too many digits for a convenient stem-and-leaf plot they should be rounded or truncated. Truncating a number means simply dropping off the unwanted digits. So, for example, a value of would become 149 if truncated to three digits, but 150 if rounded to three digits. Since the object of a stem-and-leaf display is to give a feeling for the shape and patterns in the data set, the decision on whether to round or truncate is not very important; however, generally when constructing a stem-and-leaf display the data is truncated, as this is what commonly used data analysis computer packages will do. Some of the most interesting investigations in statistics involve comparing two or more data sets. Stem-and-leaf plots are useful displays for the comparison of two data sets, as shown in the following example. Example 8 The following table gives the number disposals by members of the Port Adelaide and Brisbane football teams, in the 2004 AFL Grand Final. Port Adelaide Brisbane Construct back to back stem-and-leaf plots of these data.

20 Chapter 22 Describing the distribution of a single variable 519 Example 6 Example 7 Solution To compare the two groups, the stem-and-leaf plots are drawn back to back, using two rows per stem. Port Adelaide Brisbane represents 20 disposals 2 0represents 20 disposals The leaves on the left of the stem are centred slightly higher than the leaves on the right, which suggests that, overall, Port Adelaide recorded more disposals. The spread of disposals for Port Adelaide appears narrower than that of the Brisbane players. Exercise 22E 1 The monthly rainfall for Melbourne, in a particular year, is given in the following table (in millimetres). a b Month J F M A M J J A S O N D Rainfall (mm) Construct a stem-and-leaf plot of the rainfall, using the following stems In how many months is the rainfall 60 mm or more? 2 An investigator recorded the amount of time 24 similar batteries lasted in a toy. Her results in hours were: a b Make a stem-and-leaf plot of these times with two rows per stem. How many of the batteries lasted for more than 30 hours? 3 The amount of time (in minutes) that a class of students spent on homework on one particular night was:

21 520 Essential Advanced General Mathematics Example 8 a b c Make a stem-and-leaf plot of these times. How many students spent more than 60 minutes on homework? What is the shape of the distribution? 4 The cost of various brands of track shoes at a retail outlet are as follows. $49.99 $75.49 $68.99 $ $75.99 $39.99 $ $ $84.99 $36.98 $95.49 $28.99 $25.49 $78.99 $45.99 $46.99 $76.99 $82.99 $79.99 $ a Construct a stem-and-leaf plot of these data. b What is the shape of the distribution? 5 The students in a class were asked to write down the ages of their mothers and fathers. Mother s age Father s age a b Construct a back to back stem-and-leaf plot of these data sets. How do the ages of the students mothers and fathers compare in terms of shape, centre and spread? 6 The results of a mathematics test for two different classes of students are given in the table. Class A Class B a Construct a back to back stem-and-leaf plot to compare the data sets. b How many students in each class scored less than 50%? c Which class do you think performed better overall on the test? Give reasons for your answer Summarising data A statistic is a number that can be computed from data. Certain special statistics are called summary statistics, because they numerically summarise special features of the data set under consideration. Of course, whenever any set of numbers is summarised into just one or two figures much information is lost, but if the summary statistics are well chosen they will also help to reveal the message which may be hidden in the data set. Summary statistics are generally either measures of centre or measures of spread. There are many different examples for each of these measures and there are situations when one of the measures is more appropriate than another.

22 Measures of centre Chapter 22 Describing the distribution of a single variable 521 Mean The most commonly used measure of centre of a distribution of a numerical variable is the mean. This is calculated by summing all the data values and dividing by the number of values in the data set. Example 9 The following data set shows the number of premierships won by each of the current AFL teams, up until the end of Find the mean of the number of premiership wins. Team Premierships Carlton 16 Essendon 16 Collingwood 14 Melbourne 12 Fitzroy/Lions 11 Richmond 10 Hawthorn 9 Geelong 6 Kangaroos 4 Sydney 3 West Coast 2 Adelaide 2 Port Adelaide 1 W Bulldogs 1 St Kilda 1 Fremantle 0 Solution mean = = The mean of a sample is always denoted by the symbol x, which is called x bar. In general, if n observations are denoted by x 1, x 2,...,x n the mean is x = x 1 + x 2 + +x n n or, in a more compact version x = 1 n x i n i=1 where the symbol is the upper case Greek sigma, which in mathematics means the sum of the terms.

23 522 Essential Advanced General Mathematics Note: The subscripts on the x s are used to identify all of the n different values of x. They do not mean that the x s have to be written in any special order. The values of x in the example are in order only because they were listed in that way in the table. Median Another useful measure of the centre of a distribution of a numerical variable is the middle value, or median.tofind the value of the median, all the observations are listed in order and the middle one is the median. The median of median is 6, as there are five observations on either side of this value when the data are listed in order. Example 10 Find the median number of premierships in the AFL ladder using the data in Example 9. Solution As the data are already given in order, it only remains to decide which is the middle observation Since there are 16 entries in the table there is no actual middle observation, so the median is chosen as the value half way between the two middle observations, in this case the eighth and ninth (6 and 4). Thus the median is equal to 1 (6 + 4) = 5. The 2 interpretation here is that of the teams currently playing in the AFL, half (or 50%) have won the premiership 5 or more times and half (or 50%) have have won the premiership 5 or less times. In general, to compute the median of a distribution: Arrange all the observations in ascending order according to size. ( ) n + 1 th If n, the number of observations, is odd, then the median is the 2 observation from the end of the list. If n, the number of observations, is even, then the median is found by averaging the two middle observations in the list. That is, to find the median the nth and the ( 2 n ) th observations are added together, and divided by 2. The median value is easily determined from a stem-and-leaf plot by counting to the required observation or observations from either end.

24 Chapter 22 Describing the distribution of a single variable 523 From Examples 10 and 11, the mean number of times premierships won (6.8) and the median number of premierships won (5) have already been determined. These values are different and the interesting question is: why are they different, and which is the better measure of centre for this example? To help answer this question consider a stem-and-leaf plot of these data From the stem-and-leaf plot it can be seen that the distribution is positively skewed. This example illustrates a property of the mean. When the distribution is skewed or if there are one or two very extreme values, then the value of the mean may be quite significantly affected. The median is not so affected by unusual observations, however, and is thus often a preferable measure of centre. When this is the case, the median is generally preferred as a measure of centre as it will give a better typical value of the variable under consideration. Mode The mode is the observation which occurs most often. It is a useful summary statistic, particularly for categorical data which do not lend themselves to some of the other numerical summary methods. Many texts state that the mode is a third option for a measure of centre but this is generally not true. Sometimes data sets do not have a mode, or they have several modes, or they have a mode which is at one or other end of the range of values. Measures of spread Range A measure of spread is calculated in order to judge the variability of a data set. That is, are most of the values clustered together, or are they rather spread out? The simplest measure of spread can be determined by considering the difference between the smallest and the largest observations. This is called the range. Example 11 Consider the marks, for two different tasks, awarded to a group of students. Task A Task B Find the range of each of these data sets.

25 524 Essential Advanced General Mathematics Solution For Task A, the minimum mark is 2 and the maximum mark is 94. Range for Task A = 94 2 = 92 For Task B, the minimum mark is 11 and the maximum mark is 91. Range for Task B = = 80 The range for Task A is greater than the range for Task B. Is the range a useful summary statistic for comparing the spread of the two distributions? To help make this decision, consider the stem-and-leaf plots of the data sets: Task A Task B From the stem-and-leaf plots of the data it appears that the spread of marks for the two tasks is not well described by the range. The marks for Task A are more concentrated than the marks for Task B, except for the two unusual values for Task A. Another measure of spread is needed, one which is not so influenced by these extreme values. For this the interquartile range is used. Interquartile range To find the interquartile range of a distribution: Arrange all observations in order according to size. Divide the observations into two equal-sized groups. If n, the number of observations, is odd, then the median is omitted from both groups. Locate Q 1, the first quartile, which is the median of the lower half of the observations, and Q 3, the third quartile, which is the median of the upper half of the observations. The interquartile range IQR is defined as the difference between the quartiles. That is IQR = Q 3 Q 1

26 Chapter 22 Describing the distribution of a single variable 525 Definitions of the quartiles of a distribution sometimes differ slightly from the one given here. Using different definitions may result in slight differences in the values obtained, but these will be minimal and should not be considered a difficulty. Example 12 Find the interquartile ranges for Task A and Task B data given in Example 11. Solution For Task A the marks listed in order are: Since there is an even number of observations, then the lower half is: The median of this lower group is the eighth observation, 22, so Q 1 = 22. The upper half is: The median of this upper group is 47, so Q 3 = 47 Thus, the interquartile range, IQR = = 25 Similarly, for Task B data, the lower quartile = 31 and the upper quartile = 73, giving an interquartile range for this data set of 42. Comparing the two values of interquartile range shows the spread of Task A marks to be much smaller than the spread of Task B marks, which seems consistent with the display. The interquartile range is a measure of spread of a distribution which describes the range of the middle 50% of the observations. Since the upper 25% and the lower 25% of the observations are discarded, the interquartile range is generally not affected by the presence of outliers in the data set, which makes it a reliable measure of spread. The median and quartiles of a distribution may also be determined from a cumulative relative frequency polygon. Since the median is the observation which divides the data set in half, this is the data value which corresponds to a cumulative relative frequency of 0.5 or 50%. Similarly, the first quartile corresponds to a cumulative relative frequency of 0.25 or 25%, and the third quartile corresponds to a cumulative relative frequency of 0.75 or 75%.

27 526 Essential Advanced General Mathematics Example 13 Use the cumulative relative frequency polygon to find the median and the interquartile range for the data set shown in the graph. % Solution From the plot of the data it can be seen that the median is 10, the first quartile is 8, the third quartile is 12 and hence the interquartile range is 12 8 = 4. Standard deviation Another extremely useful measure of spread is the standard deviation. Itisderived by considering the distance of each observation from the sample mean. If the average of these distances is used as a measure of spread it will be found that, as some of these distances are positive and some are negative, adding them together results in a total of zero. A more useful measure will result if the distances are squared (which makes them all positive) and are then added together. The variance is defined as a kind of average of these squared distances. When the variance is calculated from a sample, rather than the whole population, the average is calculated by dividing by n 1, rather than n. For the remainder of this discussion it will be assumed that the data under consideration are from a sample. Since the variance has been calculated by squaring the data values it is sensible to find the square root of the variance, so that the measure reverts to a scale comparable to the original data. This results in measure of spread which is called the standard deviation. Standard deviation calculated from a sample is denoted s. Formally the standard deviation may be defined as follows. If a data set consists of n observations denoted x 1, x 2,...,x n, the standard deviation is 1 [ s = (x1 x) 2 + (x 2 x) 2 + +(x n x) 2] n 1 or, in more compact notation, s = 1 n (x i x) n 1 2 i=1

28 Chapter 22 Describing the distribution of a single variable 527 Example 14 Calculate the standard deviation of the following data set Solution Construct a table as shown. x i x i x (x i x) x i = 100 (x i x) 2 = From the table, the standard deviation s is: s = 9 = = 3.53 Interpreting the standard deviation The standard deviation can be made more meaningful by interpreting it in relation to the data set. The interquartile range gives the spread of the middle 50% of the data. Can similar statements be made about the standard deviation? It can be shown that, for most data sets, about 95% of the observations lie within two standard deviations of the mean. Example 15 The cost of a lettuce at a number of different shops on a particular day is given in the table: $3.85 $2.65 $1.90 $2.95 $2.40 $2.42 $2.63 $3.20 $4.20 $2.33 $0.85 $3.81 $1.69 $3.66 $2.60 $2.70 $3.10 $2.80 $1.80 $2.88 $1.40 Calculate the mean cost, the standard deviation and the interval equivalent to two standard deviations above and below the mean.

29 528 Essential Advanced General Mathematics Solution The mean cost is $2.66 and the standard deviation is $0.84. The interval equivalent to two standard deviations above and below the mean is: [ , ] = [0.98, 4.34]. In this case, 20 of the 21 observations, or 95% of observations, have values within the interval calculated. Example 16 The prices of forty secondhand motorbikes listed in a newspaper are as follows: $5442 $5439 $2523 $2358 $2363 $2244 $1963 $2142 $2220 $1356 $738 $656 $715 $1000 $1214 $1788 $3457 $4689 $8218 $ $ $ $8770 $8450 $6469 $7148 $ $ $ $ $ $9878 $5294 $3847 $4219 $4786 $2280 $3019 $7645 $8079 Determine the interval equivalent to two standard deviations above and below the mean. Solution The mean price is $5729 and the standard deviation is $4233 (to the nearest whole dollar). The interval equivalent to two standard deviations above and below the mean is: [ , ] = [ 2737, ]. The negative value does not give a sensible solution and should be replaced by of the 40 observations, or 95% of observations, have values within the interval. The exact percentage of observations which lie within two standard deviations of the mean varies from data set to data set, but in general it will be around 95%, particularly for symmetric data sets. It was noted earlier that even a single outlier can have a very marked effect on the value of the mean of a data set, while leaving the median unchanged. The same is true when the effect of an outlier on the standard deviation is considered, in comparison to the interquartile range. The median and interquartile range are called resistant measures, while the mean and standard deviation are not resistant measures. When considering a data set it is necessary to do more than just compute the mean and standard variation. First it is necessary to examine the data, using a histogram or stem-and-leaf plot to determine which set of summary statistics is more suitable.

30 Chapter 22 Describing the distribution of a single variable 529 Using the TI-Nspire The calculator can be used to calculate the values of all of the summary statistics in this section. Consider the data from Example 16. The data is easiest entered in a Lists & Spreadsheet application ( 3). Firstly, use the up/down arrows ( ) to name the first column bike. Then enter each of the 40 numbers as shown. Open a Calculator application ( 1) to calculate the summary statistics. Select the One-Variable Statistics command from the Stat Calculations submenu of the Statistics menu (b 6 11), specify in the dialog box that there is only one list, and then complete the final dialog box as shown. Press enter to calculate the values of the summary statistics. Use the up arrow ( ) to view the rest of the summary statistics.

31 530 Essential Advanced General Mathematics The calculator can also be used to determine the summary statistics when the data is given in a frequency table such as: x Frequency The data is easiest entered in a Lists & Spreadsheet application ( 3). Firstly, use the up/down arrows ( ) to name the first column x and the second column freq. Then enter the data as shown. Open a Calculator application ( 1) to calculate the summary statistics. Select the One-Variable Statistics command from the Stat Calculations submenu of the Statistics menu (b 6 11), specify in the dialog box that there is only one list, and then complete the final dialog box as shown. Press enter to calculate the values of the summary statistics. Using the Casio ClassPad Consider the following heights in cm of a group of eight women. 176, 160, 163, 157, 168, 172, 173, 169 Enter the data into list1 in the module. Tap Calc, One-Variable and when prompted ensure that the XList is set to list1 and the Freq = 1 (since each score is entered individually). The calculator returns the results as shown and all univariate statistics can be viewed by using the scroll bar. Note that the standard deviation is given by x n 1. Where data is grouped, the scores are entered in list1 and the frequencies in list2. In this case, in Set Calculation use the drop-down arrow to select list2 as the location for the frequencies.

32 Chapter 22 Describing the distribution of a single variable 531 Exercise 22F 1 Find the mean and the median of the following data sets. Examples 9, 10 a Examples 11, 12 b c d Find the mean and the median of the following data sets. a x Frequency b x Frequency The price, in dollars, of houses sold in a particular suburb during a one-week period are given in the following list. $ $ $ $ $ $ $ $ $ $ $ $ $ $ Find the mean and the median of the prices. Which do you think is a better measure of centre of the data set? Explain your answer. 4 Concerned with the level of absence from his classes a teacher decided to investigate the number of days each student had been absent from the classes for the year to date. These are his results. No. of days missed No. of students Find the mean and the median number of days each student had been absent so far that year. Which is the better measure of centre in this case? 5 Find the range and the interquartile range for each of the following data sets. a b c d

33 532 Essential Advanced General Mathematics Example 14 Example 15 Example 13 6 The serum cholesterol levels for a sample of twenty people are: a Find the range of the serum cholesterol levels. b Find the interquartile range of the serum cholesterol levels. 7 Twenty babies were born at a local hospital on one weekend. Their birth weights, in kg, are given in the stem-and-leaf plot below. a b represent 3.6 kg Find the range of the birth weights. Find the interquartile range of the birth weights. 8 Find the standard deviation for the following data sets. a b $2.52 $4.38 $3.60 $2.30 $3.45 $5.40 $4.43 $2.27 $4.50 $4.32 $5.65 $6.89 $1.98 $4.60 $5.12 $3.79 $4.99 $3.02 c d For each of the following data sets a calculate the mean and the standard deviation b determine the percentage of observations falling within two standard deviations of the mean. i ii Agroup of university students was asked to write down their ages with the following results a b c Construct a cumulative relative frequency polygon and use it to find the median and the interquartile range of this data set. Find the mean and standard deviation of the ages. Find the percentage of students whose ages fall within two standard deviations of the mean.

34 Example 17 Chapter 22 Describing the distribution of a single variable The results of a student s chemistry experiment are as follows a i Find the mean and the median of the results. ii Find the interquartile range and the standard deviation of the results. b Unfortunately when the student was transcribing his results into his chemistry book he made a small error, and wrote: c i Find the mean and the median of these results. ii Find the interquartile range and the standard deviation of these results. Describe the effect the error had on the summary statistics calculated in parts a and b. 12 A selection of shares traded on the stock exchange had a mean price of $50 with a standard deviation of $3. Determine an interval which would include approximately 95% of the share prices. 13 A store manager determined the store s mean daily receipts as $550, with a standard deviation of $200. On what proportion of days were the daily receipts between $150 and $950? 22.7 The boxplot Knowing the median and quartiles of a distribution means that quite a lot is known about the central region of the data set. If something is known about the tails of the distribution then a good picture of the whole data set can be obtained. This can be achieved by knowing the maximum and minimum values of the data. These five important statistics can be derived from a data set: the median, the two quartiles and the two extremes. These values are called the five-figure summary and can be used to provide a succinct pictorial representation of a data set called the box and whisker plot, orboxplot. For this visual display, a box is drawn with the ends at the first and third quartiles. Lines are drawn which join the ends of the box to the minimum and maximum observations. The median is indicated by a vertical line in the box. Example 17 Draw a boxplot to show the number of hours spent on a project by individual students in a particular school

35 534 Essential Advanced General Mathematics Solution First arrange the data in order From this ordered list prepare the five-figure summary. median, m = first quartile, Q 1 = = third quartile, Q 3 = = minimum = 2 maximum = 264 The boxplot can then be drawn min = 2 m = 71 Q 1 = 25.5 Q 3 = max = 264 In general, to draw a boxplot: Arrange all the observations in order, according to size. Determine the minimum value, the first quartile, the median, the third quartile, and the maximum value for the data set. Draw a horizontal box with the ends at the first and third quartiles. The height of the box is not important. Join the minimum value to the lower end of the box with a horizontal line. Join the maximum value to the upper end of the box with a horizontal line. Indicate the location of the median with a vertical line. Using a graphics calculator Agraphics calculator can be used to construct a boxplot. Consider the data from Example 17. Enter the data into a list named HOURS. Todraw the boxplot press 2ND STAT PLOT and select and turn on Plot1, as previously described. 300

36 Chapter 22 Describing the distribution of a single variable 535 Press the down arrow key and select from the Type menu the boxplot icon as shown, then press ENTER. Use the LIST menu to paste HOURS as the Xlist. Your calculator screen should appear like this. To bring up the boxplot, press ZOOM and then 9:ZoomStat.Your calculator screen should now look like this. To find out values for the five-figure summary, select TRACE. The symmetry of a data set can be determined from a boxplot. If a data set is symmetric, then the median will be located approximately in the centre of the box, and the tails will be of similar length. This is illustrated in the following diagram, which shows the same data set displayed as a histogram and a boxplot. A median placed towards the left of the box, and/or a long tail to the right indicates a positively skewed distribution, as shown in this plot.

37 536 Essential Advanced General Mathematics A median placed towards the right of the box, and/or a long tail to the left indicates a negatively skewed distribution,asillustrated here. A more sophisticated version of a boxplot can be drawn with the outliers in the data set identified. This is very informative, as one cannot tell from the previous boxplot if an extremely long tail is caused by many observations in that region or just one. Before drawing this boxplot the outliers in the data set must be identified. The term outlier is used to indicate an observation which is rather different from other observations. Sometimes it is difficult to decide whether or not an observation should be designated as an outlier. The interquartile range can be used to give a very useful definition of an outlier. An outlier is any number which is more than 1.5 interquartile ranges above the upper quartile, or more than 1.5 interquartile ranges below the lower quartile. When drawing a boxplot, any observation identified as an outlier is indicated by an asterisk, and the whiskers are joined to the smallest and largest values which are not outliers. Example 18 Use the data from Example 17 to draw a boxplot with outliers. Solution median = 71 interquartile range = Q 3 Q 1 = = 84 An outlier will be any observation which is less than = 100.5, which is impossible, or greater than = From the data it can be seen that there is only one observation greater than this, 264, which would be denoted with an asterisk. The upper whisker is now drawn from the edge of the box to the largest observation less than 235.5, which is 226. *

Chapter 22 Describing the distribution of a single variable 537 Using the TI-Nspire The calculator can

The data is easiest entered in a Lists & Spreadsheet application ( 3).

Then enter each of the 33 numbers as shown. Open a Data & Statistics application ( 5) tograph the data.

Specify the x variable by selecting Add X Variable from the Plot Properties (b 2 4) and selecting hours.

38 Chapter 22 Describing the distribution of a single variable 537 Using the TI-Nspire The calculator can be used to construct a boxplot. Consider the data from Example 17. The data is easiest entered in a Lists & Spreadsheet application ( 3). Firstly, use the up/down arrows ( )to name the first column hours. Then enter each of the 33 numbers as shown. Open a Data & Statistics application ( 5) tograph the data. At first the data displays as shown. Specify the x variable by selecting Add X Variable from the Plot Properties (b 2 4) and selecting hours. The data now displays as shown. (Note: It is also possible to use the NavPad to move down below the x-axis and click to add the x variable.) Select Box Plot from the Plot Type menu (b 12). The data now displays as shown. Notice how the calculator, by default, shows any outlier(s).

1.1 variable Categorical data categorical Numerical data numerical

1.1 variable Categorical data categorical Numerical data numerical C H A P T E R 1 Univariate data What are categorical and numerical data? What is a bar chart and when is it used? What is a histogram and when is it used? What is a stem-and-leaf plot and when is it used?