Chapter 4: Displaying and Summarizing Quantitative Data

Similar documents
Kathryn Robinson. Grades 3-5. From the Just Turn & Share Centers Series VOLUME 12

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

are the objects described by a set of data. They may be people, animals or things.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 4. Displaying and Summarizing. Quantitative Data

Math 140 Introductory Statistics

Math 140 Introductory Statistics

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

HI SUMMER WORK

Elementary Statistics

Solving Quadratic Equations by Graphing 6.1. ft /sec. The height of the arrow h(t) in terms

Lab Activity: Weather Variables

Describing distributions with numbers

STAT 200 Chapter 1 Looking at Data - Distributions

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

NAWIC. National Association of Women in Construction. Membership Report. August 2009

Units. Exploratory Data Analysis. Variables. Student Data

Research Update: Race and Male Joblessness in Milwaukee: 2008

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 2: Tools for Exploring Univariate Data

CHAPTER 1. Introduction

1. Evaluation of maximum daily temperature

Histograms allow a visual interpretation

Describing distributions with numbers

Lecture 1: Descriptive Statistics

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

JUPITER MILLER BUSINESS CENTER 746,400 SF FOR LEASE

Section 2.3: One Quantitative Variable: Measures of Spread

Scaling in Biology. How do properties of living systems change as their size is varied?

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Chapter 4.notebook. August 30, 2017

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Introduction to Statistics

Chapter 5. Understanding and Comparing. Distributions

Describing Distributions with Numbers

, District of Columbia

North American Geography. Lesson 5: Barnstorm Like a Tennis Player!

Chapter 6 Group Activity - SOLUTIONS

Describing Distributions With Numbers

Chapter 1. Looking at Data

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Exercises 36 CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

Section 3.2 Measures of Central Tendency

MATH 1150 Chapter 2 Notation and Terminology

Vibrancy and Property Performance of Major U.S. Employment Centers. Appendix A

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Chapter2 Description of samples and populations. 2.1 Introduction.

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Chapter 5: Exploring Data: Distributions Lesson Plan

AP Final Review II Exploring Data (20% 30%)

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture 2 and Lecture 3

3.1 Measure of Center

Determining the Spread of a Distribution

Determining the Spread of a Distribution

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Resistant Measure - A statistic that is not affected very much by extreme observations.

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Describing Distributions

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Chapter 3. Data Description

Exercises from Chapter 3, Section 1

STT 315 This lecture is based on Chapter 2 of the textbook.

CHAPTER 1 Exploring Data

American Tour: Climate Objective To introduce contour maps as data displays.

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Statistics I Chapter 2: Univariate data analysis

CHAPTER 2: Describing Distributions with Numbers

Chapters 1 & 2 Exam Review

Authors: Antonella Zanobetti and Joel Schwartz

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

City Number Pct. 1.2 STEMS AND LEAVES

Descriptive Statistics

Descriptive Data Summarization

Statistics I Chapter 2: Univariate data analysis

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

1. Exploratory Data Analysis

Percentile: Formula: To find the percentile rank of a score, x, out of a set of n scores, where x is included:

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Chapter 1: Exploring Data

Week 1: Intro to R and EDA

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

1.3.1 Measuring Center: The Mean

Investigation 11.3 Weather Maps

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Understanding the Impact of Weather for POI Recommendations

Lecture: Sampling and Standard Error LECTURE 8 1

P1: OTA/XYZ P2: ABC JWBS077-fm JWBS077-Horstmeyer July 30, :18 Printer Name: Yet to Come THE WEATHER ALMANAC

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Statistics and parameters

RNR 516A. Computer Cartography. Spring GIS Portfolio

Performance of fourth-grade students on an agility test

Chapter 3 Data Description

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Math 3339 Homework 2 (Chapter 2, 9.1 & 9.2)

Transcription:

Chapter 4: Displaying and Summarizing Quantitative Data This chapter discusses methods of displaying quantitative data. The objective is describe the distribution of the data. The figure below shows three idealized distributions spanning a range of values from 0 to 10. There are a few key features of a distribution that convey nearly all of the information contained in the data. They are 1. The center and most common values (modes).. The spread of the distribution. 3. The shape. Shapes are symmetric (black), skewed right (red), and skewed left (blue). 4. Number of modes. The lower figure on this page shows a distribution with two modes. 0.00 0.05 0.10 0.15 0.0 0.5 0.30 0 4 6 8 10 5. Deviations from overall shape and outliers. Outliers are values that are far away from the majority of values. Stemplot: A stemplot is a plot which shows the distribution and includes the numerical values on the plot. The method for drawing a stemplot is illustrated with the following example. Example: How effective is antibacterial soap? To investigate this question, data were collected on the number of bacterial colonies present on previously sterile media plates two days after placing a hand washed with water (group 1) 0.1 0. 0.3 0.4 0 4 6 8 10 and antibacterial soap (grouop ) on the plates. The objective is to compare the two distributions of bacterial colonies. Specifically, the aim is to determine whether the antibacterial soap plates have fewer colonies that the water-only plates. More generally, the objective is to characterize the distributions with respect to the center, spread and shape. 18

The data on number of colonies are Group 1: Washed with water Group : Washed with soap 98 118 89 97 17 87 103 101 111 8 116 99 11 67 111 98 108 189 149 15 105 114 104 97 94 59 7 40 88 96 77 105 90 134 89 63 11 108 114 73 41 67 To make a stemplot, the rightmost digit of a number becomes a leaf, and the remaining digits become a stem. The rightmost digit is written on the stem. The stemplot constructed from the anti-bacterial soap plates is on the right. The distribution of the number of 7 7 colonies is skewed to the right and is 8 7 8 9 9 centered around 105 colonies. There is 9 0 6 7 8 8 one large outlier (189). Most of the 10 3 5 5 8 distribution is between 87 and 17 11 1 8 colonies. 1 5 7 13 4 Notes: 14 9 Decimals points are omitted from the stemplot. The spacing between values remains the same along a row of the stemplot. 15 16 17 18 9 The advantage of a stemplot is that it is simple and quick and is the useful for handdrawn distributional displays. Also, the individual data are retained. Its disadvantage is that it is awkward or impossible to use with large data sets. A visual comparison of the bacterial counts for the antibacterial soap and the water-only plates is obtained from a back-to-back stemplot. The back-to-back stemplot uses a common stem with leaves branching off in opposite directions, as shown below. 19

Water only Antibacterial soap 4 0 1 The back-to-back stemplot stemplot reveals 5 9 that the water-only distribution 6 3 7 7 is shifted towards 7 7 3 larger values and is 9 9 8 7 8 right-skewed. The 8 8 7 6 0 9 4 7 9 antibacterial distribution 8 5 5 3 10 1 4 8 is not skewed. 8 1 11 1 4 4 6 7 5 1 1 4 13 9 14 15 16 17 9 18 Another simple method appropriate only for small data sets is the dotplot. It shows individual data values along with an identifier of the observational unit. The dotplot below shows annual precipitation for 70 large U.S. cities. Mobile Miami San Juan New Orleans Juneau Jacksonville Jackson Memphis Little Rock Atlanta Houston Columbia Nashville Atlantic City Norfolk Hartford Louisville Providence Charlotte Richmond Raleigh Boston Baltimore Charleston New York Wilmington Philadelphia Cincinati Washington Seattle Tacoma Indianapolis Portland Columbus Kansas City Pittsburg Concord Buffalo Dallas St Louis Peoria Cleveland Chicago Albany Burlington Sault Ste. Marie Oklahoma City Detroit Des Moines Wichita Omaha Duluth Milwaukee Minneapolis/St Paul Sioux Falls Honolulu San Francisco Spokane Sacramento Bismark Salt Lake City Great Falls Cheyenne Los Angeles Denver Boise El Paso Albuquerque Reno Phoenix 10 0 30 40 50 60 Inches per year 0

A histogram is a plot which breaks the range of the data (smallest to largest) into intervals and displays the frequency (or relative frequency) of the observations that fall into each interval. Histograms are the usual approach to displaying the distribution of large data sets. Example: The annual precipitation data are summarized in a table and a histogram (below). The table and histogram show the number of cities with annual precipitation means falling in each of seven intervals. Interval Relative A histogram is constructed by (Inches) Frequency Frequency forming intervals (categories, 0-10 4 0.06 actually), of 0 10 inches, 10-0 9 0.13 10 0 inches and so on, and 0-30 5 0.07 counting the number of values 30-40 5 0.36 that belong to each interval. 40-50 1 0.30 50-60 5 0.07 The choice of intervals is 60-70 1 0.01 somewhat subjective, and some 70 1.000 trial and error may be necessary to produce a good histogram. Statistical software are usually good at choosing appropriate intervals. The histogram reveals that the distribution of precipitation for the 70 cities is roughly unimodal and centered near 36 inches. Most of the values are between 10 and 60. (The dotplot reveals that 10 to 53 is a better characterization of spread of the data). Frequency 0 5 10 15 0 5 Remarks: 0 10 0 30 40 50 60 70 Inches per year The interval widths are equal. This is crucial for obtaining a histogram which accurately reflects the distribution. Intervals and bars are contiguous. 1

Relative frequency (or percentage or proportion) could be used for the bar heights instead of the frequency. The primary advantage of the histogram over the stemplot is that arbitrarily large data sets can be be displayed using a histogram. The primary disadvantage of the histogram is that it does not retain the actual data values in the plot. Every graph (not just histograms) must have labels for the horizontal axis, and almost always for the vertical axis. When examining a graph, the first place you should look is at the axis labels. Distributional shapes: The important characteristics of shape are 1. Mode: The term mode is synonymous with peak. A distribution with one peak is said to be unimodal; a distribution with two peaks is bimodal; and a distribution with more than two peaks is called multimodal. If a distribution does not have a distinct mode, the distribution is said to be uniform, or approximately uniform.. Skewness: Skew describes the length of the tails of the distribution. A unimodal distribution will either be symmetric, have a longer tail toward larger values and be called skewed to the right, or have a longer tail toward smaller values and be called skewed to the left. The first figure on page 18 illustrates skew. 3. Unusual features: Unusual features of a distribution often tell something interesting about the data (and the population from which they were sampled). Unusual features are principally outliers 1 (values unusually distant from the bulk of the distribution). Numerical summaries of quantitative data There are two primary features of the distribution of a quantitative variable: the center and spread. Statistics used to describe center, spread, and the shape are tabled to the right. Feature Center Spread Shape Measures Median Mean Midrange Mode Interquartile range (IQR) Standard deviation Range 5-Number summary 1 A rule for identifying outliers will be introduced later.

Measures of center 1. The median (M) of a distribution is a value which divides the ordered data values into two sets equal in number. To find the median (a) Order the data from smallest to largest. (b) Determine the number of data values (usually called the sample size and denoted by n). Note whether n is even or odd. i. If n is odd, then the median is the value at the middle point in the ordered list; specifically the median is the value at the n+1 th position. ii. If n is even, the median is between the two middle values of the ordered list. These are located at position n and n + 1. The convention is to take the average of these two middle values. Example: Return to the annual precipitation data: (a) The sample size is n = 70, and so the median is the average of the 35th and 36th smallest values. They are 36. and 37.0 (Pittsburg and Kansas City). The average of these values is the median; hence M = 36.+37 1 = 36.6 inches. (b) Suppose that we include Missoula in the data set. The average annual precipitation for Missoula is 13.7 inches. What is the median of this set of data?. The mean of a set of data is the average. Let y 1, y,..., y n denote n data values. The mean is y = 1 n n i=1 y i = 1 n (y 1 + y + + y n ) = y 1 + + y n. n The symbol y often is referred to as y-bar. The notation 1 n y sometimes is used for the mean. The mean of the precipitation data is y = 7.0 + 7. +... + 59.8 + 67.0 70 = 34.9 inches. With Missoula included, the mean is 34.6 inches. After including Missoula, n = 71, so the n+1 = 36th smallest ordered value is the median. Pittsburg was 35th; now it s 36th and so the new median is 36. inches. 3

3. The mode of a data set is the value that occurs most frequently. If there are multiple values that occur most frequently, all of these values are modes of the distribution. The mode generally is not an appropriate measure of center. To illustrate, I ve rounded the precipitation data to the nearest integer, and used to these data construct the histogram to the below. The modes are 36 and 43 inches. The characterization of the center by these modes is not particularly informative. The histogram also uninformative as the intervals are too small to adequately portray the shape of the data. A statistic is resistant if it is not substantially affected by Frequency 0 1 3 4 5 changes in the numerical values of a small proportion of the observations. 0 10 0 30 40 50 60 70 Annual precipitation (inches) Outliers and long tails often substantially affect a statistic that is not resistant. A statistic which is not resistant to some distributional feature is sensitive to that feature. Sensitivity to outliers The median is resistant to the effects of outliers whereas the mean is sensitive. To illustrate, consider the effects of accidentally recording 670 inches for Mobile instead of 67.0. The mean would have been computed as 43.5 inches, but the median (36.6 inches) would not be different. The effect of skew differs between mean and median, specifically, the mean is shifted toward a long tail compared to the median. (The median resists being shifted toward a long tail). Symmetric: M = y; Skewed right: M < y; Skewed left: y < M. 4

Percentiles and quartiles The p th percentile of a distribution is that value such that p% of the data values fall below it. If your SAT math percentile was 80%, then your score was larger than 80% of all scores (and smaller than 0%). The quartiles are the 5 th, 50 th and 75 th percentiles; and so they divide the data set into four sets of equal size. The notation is Measures of spread Q 1 = 5 th percentile = 1 st quartile M = 50 th percentile = nd quartile, or the median Q 3 = 75 th percentile = 3 rd quartile 1. The range of the data is the difference between the maximum and minimum values: Range = Max Min. The range is too sensitive to outliers to be of much use.. The interquartile range is the distance between the 5 th and 75 th percentiles; hence, Remarks: IQR = Q 3 Q 1. The IQR measures the spread of the middle 50% of the data. The IQR is a single number, not Q 1 and Q 3. Instead, it is the width of the interval between Q 1 and Q 3. There are several algorithms for finding the quartiles, all of which find the median and use it to divide the data into upper and lower halves and find the medians of each half; these medians are the quartiles. Deveaux et al. recommend this algorithm: If n is even 3, then include the n smallest value in the lower data half and include the n + 1 smallest value in the upper data. If n is odd4, the include the median (the n + 1 smallest value) in both the lower and upper halves of the data. 3 then the median is the average of two middle values 4 then the median is the single middle value 5

To illustrate, consider the following data on mean temperature, by month, for Missoula (units are degrees Fahrenheit). Month J F M A M J J A S O N D Temp.7 9. 35.8 44. 51.8 60.0 66.8 65.8 55.7 44. 3.4 3.4 The ordered values are Order 1 3 4 5 6 7 8 9 10 11 1 Temp.7 3.4 9. 3.4 35.8 44. 44. 51.8 55.7 60.0 65.8 66.8 Then, and Q 1 = 9. + 3.4 = 9.8 degrees 55.7 + 60 Q 3 = = 57.85 degrees IQR = 57.85 9.8 = 8.05 degrees The monthly temperature quartiles for San Francisco are Q 1 = 53.0, Q = 56.85 and Q 3 = 61.8. The following table compares Missoula, San Francisco, and one other city. Table 1: Measures of center and spread for Missoula and San Francisco (degrees Fahrenheit). City M IQR Missoula 44. 8.05 San Francisco 56.8 8.8? 40.5 9. The figure to the right is a time plot of the mean monthly temperature against month. From this Figure, the constancy of temperatures in San Francisco becomes strikingly obvious. It is also apparent that the mystery city is generally colder than Missoula and that the greatest difference in mean monthly temperatures occurs in the summer and winter. degrees F 0 30 40 50 60 Missoula San Francisco? 4 6 8 10 1 6 Month

3. The standard deviation is the most commonly used numerical summary of distributional spread. It is (roughly) the average difference between the mean y and the data values. Recall: The values in a data set are denoted y 1, y,..., y n for a sample size of n. The standard deviation is computed from the deviations of the observed values from the mean, namely: y 1 y, y y,..., y n y. Since the deviations sum to 0, the average deviation is not a measure of spread. To rectify this problem, the standard deviation is computed from the squared deviations (which are all greater than or equal to 0). The squared deviations are summed and divided by n 1. The final operation computes the square root. A formula for the standard deviation is (y y) s = n 1. This is roughly the average distance of the data values from the mean, which is a logical measure of spread. The term roughly is used because n 1 is the denominator rather than n. Taking the square root puts it back in the original units of measure. Squaring the standard deviation gives the variance. The relationship between the two are summarized by var = s s = var A related, alternative measure is the median absolute deviation about the median: 1 yi M. n 7

Example: To illustrate, the standard deviation of monthly temperature averages from Vostok, Antarctica 5 (elevation: 110 feet, latitude: 78 7S, longitude: 106 5E) is computed: First, the annual mean temperature is y = 68.5 degrees F. The three columns on the right show the intermediate steps; the formula on the left shows the last stages of the calculation. n i=1 s = (y i y) n 1 674.9 = 11 = 570.4 = 3.9 degrees F s = 3.9 degrees F is interpreted as the average difference between the annual mean temperature (y = 68.5) and the monthly mean temperatures. For comparison, the standard deviation y i (y i y) (y i y) 31 37.6 141.5 48 0.6 43.7 67 1.6.5 81 1.4 154. 8 13.4 180.0 89 0.4 416.8 89 0.4 416.8 95 6.4 697.8 90 1.4 458.7 75 6.4 41. 48 0.6 43.7 8 40.6 1647.0 Total 0 674.9 of the Missoula temperatures was s = 16.1 degrees F. There s considerably greater variability in month mean temperatures at Vostok compared to Missoula. The 5-number summary numerically summarizes the shape of distribution. It is (Min, Q 1, M, Q 3, Max). To compare the three cities more closely, their 5-number summaries are: City Min Q 1 M Q 3 Max? 16 6.5 40.5 55.5 63 Missoula 9.8 44. 57.85 67 San Francisco 48.7 53.0 56.8 61.8 67.4 The mystery city does appear colder than Missoula throughout the year. 5 The lowest recorded temperature in 3 years of records was 17 degrees F. For comparison, the freezing point of CO is 108.4 degrees F. 8

Summary of resistant and sensitive measures: The mean is sensitive to the effects of outliers, whereas the median is resistant to the effects of outliers. The standard deviation is sensitive to the effects of outliers whereas the IQR is resistant to the effects of outliers. The IQR is resistant because it is the difference between Q 3 and Q 1. No outliers are used in the calculation of the IQR since unusually large observations have little effect Q 3 and Q 1. In contrast, all data values (including outliers) are used in the calculation of the standard deviation. Summarizing a distribution with a measure of center and spread Since the mean and standard deviation are not resistant, they are not appropriate for skewed distributions or distributions with outliers. They re most appropriate for symmetric distributions with no outliers. Situation Symmetric distribution with no outliers Skewed distributions Symmetric distributions with outliers Measures to use Mean and SD or median and IQR Median and IQR Median and IQR 9