Let's Do It! What Type of Variable?

Similar documents
Let's Do It! What Type of Variable?

STAT 200 Chapter 1 Looking at Data - Distributions

MATH 1150 Chapter 2 Notation and Terminology

Describing distributions with numbers

Chapter 2: Tools for Exploring Univariate Data

Chapter 1. Looking at Data

Introduction to Statistics

Chapter 3. Data Description

Chapter 1: Exploring Data

MATH 117 Statistical Methods for Management I Chapter Three

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Unit 2. Describing Data: Numerical

3.1 Measure of Center

Describing distributions with numbers

CHAPTER 2: Describing Distributions with Numbers

TOPIC: Descriptive Statistics Single Variable

1.3.1 Measuring Center: The Mean

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Elementary Statistics

Range The range is the simplest of the three measures and is defined now.

Lecture 1: Description of Data. Readings: Sections 1.2,

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

STT 315 This lecture is based on Chapter 2 of the textbook.

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Describing Distributions with Numbers

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter. Numerically Summarizing Data Pearson Prentice Hall. All rights reserved

Resistant Measure - A statistic that is not affected very much by extreme observations.

2011 Pearson Education, Inc

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

CIVL 7012/8012. Collection and Analysis of Information

AP Final Review II Exploring Data (20% 30%)

Histograms allow a visual interpretation

A is one of the categories into which qualitative data can be classified.

Chapter 4. Displaying and Summarizing. Quantitative Data

CHAPTER 1. Introduction

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Exercises from Chapter 3, Section 1

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Performance of fourth-grade students on an agility test

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

SESSION 5 Descriptive Statistics

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

The science of learning from data.

Lecture 1: Descriptive Statistics

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Practice problems from chapters 2 and 3

Units. Exploratory Data Analysis. Variables. Student Data

1 Measures of the Center of a Distribution

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

STA 218: Statistics for Management

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Stat 101 Exam 1 Important Formulas and Concepts 1

Example 2. Given the data below, complete the chart:

Lecture 2. Descriptive Statistics: Measures of Center

Chapter 6 Group Activity - SOLUTIONS

Unit 2: Numerical Descriptive Measures

Chapter 1:Descriptive statistics

After completing this chapter, you should be able to:

Chapter 5. Understanding and Comparing. Distributions

Chapter 5: Exploring Data: Distributions Lesson Plan

Descriptive Statistics Solutions COR1-GB.1305 Statistics and Data Analysis

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Honors Algebra 1 - Fall Final Review

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Descriptive Data Summarization

Vocabulary: Samples and Populations

Lecture 2 and Lecture 3

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Descriptive Statistics-I. Dr Mahmoud Alhussami

Statistics and parameters

MgtOp 215 Chapter 3 Dr. Ahn

P8130: Biostatistical Methods I

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

University of Jordan Fall 2009/2010 Department of Mathematics

Chapter2 Description of samples and populations. 2.1 Introduction.

Instructor: Doug Ensley Course: MAT Applied Statistics - Ensley

MEASURING THE SPREAD OF DATA: 6F

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

Chapter 4.notebook. August 30, 2017

1. Exploratory Data Analysis

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

are the objects described by a set of data. They may be people, animals or things.

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

3.3. Section. Measures of Central Tendency and Dispersion from Grouped Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Determining the Spread of a Distribution

Transcription:

Ch Online homework list: Describing Data Sets Graphical Representation of Data Summary statistics: Measures of Center Box Plots, Outliers, and Standard Deviation Ch Online quizzes list: Quiz 1: Introduction Quiz: Data tables and graphical representation Quiz3: Measures of Center Calculate and Interpret Quiz4: Skewness Quiz5: Box-plots Quiz 6: Measures of Variability 1

.1-.3: Organizing Data DEFINITIONS: Qualitative Data are those which classify the units into categories. The categories may or may not have a natural ordering to them. Qualitative variables are also called categorical variables. Quantitative variables have numerical values that are measurements (length, weight, and so on) or counts (of how many). Arithmetic operations on such numerical values do have meaning. We further distinguish quantitative variables based on whether or not the values fall on a continuum Let's Do It! What Type of Variable? Hurricane Charles, in August 004, has been blamed for at least 16 deaths. Listed below is information on other major storms and hurricanes that occurred from 1994 to 003. Estimated Damage/Cost* StormName Date Category Tropical Storm Alberto Jul-94 n/a $1.billion 3 Deaths Hurricane Marilyn Sep-95 $.5billion 13 Hurricane Opal Oct-95 3 $3.6billion 7 Hurricane Fran Sep-96 3 $5.8billion 37 Hurricane Bonnie Aug-98 3 $1.1billion 3 Hurricane Georges Sep-98 $6.5billion 16 Hurricane Floyd Sep-99 $6.5billion 77 Tropical Storm Allison Jun-01 n/a $5.1billion 43 Hurricane Isabel Sep-03 $4.0billion 47 For each variable, determine whether it is qualitative or quantitative. If the variable is quantitative, state whether it is discrete or continuous. (a) The name of the storm. (b) The date the storm occurred. (c) The category of the storm. (d) The estimated amount of damage or cost of the storm. (e) The number of deaths that occurred.

DEFINITIONS: The distribution of a variable provides the possible values that a variable can take on and how often these possible values occur. The distribution of a variable shows the pattern of variation of the variable. Let's Do It! College Admissions The following pie chart shows the breakdown of undergraduate enrollment by race at the University of Michigan for the fall term of 1996. The total number of undergraduates enrolled for that term was,604. (a) What percentage of undergraduates enrolled were of nonwhite race? (b) How many undergraduates enrolled had no racial category listed? 3

Let's Do It! Allied Van Lines surveyed 1000 respondents in May 004. The question asked was, Would you move if your mate had to relocate overseas because of work? The results are summarized in the following pie chart. 30% Yes No 68% % Not Sure (a) What percentage of the respondents said that they would actually move if their mate relocated overseas? (b) What questions would you ask about the sample selection using this information to draw formal conclusions? Let's Do It! Nothing Really Matters The bar graph shown here displays the percentage of respondents who think a particular problem is the most important problem facing America for two different years. (SOURCE: The Economist, March 30-April 5, 1996, pg 33.) (a) In January 199, which problem category had the highest percentage of responses? Was this the same category which had the highest percentage of responses in 1996? (b) In January 199, what percentage of respondents reported crime as the most important problem facing America? In January 1996, what percentage of respondents reported crime as the most important problem facing America? (c) What is the approximate sum of the percentage of respondents across all of the listed problem categories for January 199? Is this sum approximately 100%? If not, give a possible reason why not. 4

Example: A Misleading Bar Graph Problem The bar graph that follows presents the total sales figures for three realtors. When the bars are replaced with pictures, often related to the topic of the graph, the graph is called a pictogram. Total Sales $.05 million $1.41 million $0.9 million No. 1 No. Realtor #1 Realtor # Realtor No. 3 #3 Realtor (a) How does the height of the home for Realtor 1 compare to that for Realtor 3? (b) How does the area of the home for Realtor 1 compare to that for Realtor 3? What We ve Learned: When you see a pictogram, be careful to interpret the results appropriately, and do not allow the area of the pictures to mislead you. 5

A frequency distribution is the organization of raw data in table form, using classes and frequencies. Each raw data value is placed into a quantitative or qualitative category called a class. The frequency of a class then is the number of data values contained in a specific class. Two types of frequency distributions that are most often used are the categorical frequency distribution and the grouped frequency distribution. Categorical Frequency Distributions: The categorical frequency distribution is used for data that can be placed in specific categories, such as nominal- or ordinal-level data. Grouped Frequency Distributions: When the range of the data is large, the data must be grouped into classes that are more than one unit in width. Let's Do It! Categorical Frequency Distributions A survey was taken on how much trust people place in the information they read on the Internet. A = trust in everything they read, M = trust in most of what they read, H = trust in about half of what they read, S = trust in a small portion of what they read. Construct a categorical frequency distribution for the data. M M M A H M S M H M S M M M M A M M A M M M H M M M H M H M A M M M H M M M M M 6

Histograms and Pie Charts The Histogram: is a graph that displays the quantitative data by using contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the classes. The Pie Graph: a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. The angle (in degrees) of each wedge is given by: Angle = relative frequency*360. Let's Do It! Distribution of scores For 108 randomly selected college applicants, the following frequency distribution for entrance exam scores was obtained. Class limits Frequency a. Construct a relative frequency histogram for the data. 90 98 6 99 107 108 116 43 117 15 8 16 134 9 b. Applicants who score above 107 need not enroll in a summer developmental program. In this group, how many students do not have to enroll in the developmental program? 7

Let's Do It! Matching Shapes to Characteristics Distribution 1 Distribution Characteristic = Characteristic = Distribution 3 Distribution 4 Characteristic = Characteristic = Characteristics: 1. Distribution of age for the population of the United States in the year 1980. Describe and explain the shape of the distribution.. Distribution of miles of coastline for the 50 United States. Describe and explain the shape of the distribution. Which states do you think would be in the last class furthest to the right? 3. Distribution of the number of miles traveled to work, that is, commuting distance for employed adults in a city. Describe and explain the shape of the distribution. 4. Distribution of age at death for the population of the United States (year 1980). Describe and explain the shape of the distribution. 8

Pie Graph Let s Explore It! The assets of the richest 1% of Americans are distributed as follows. Make a pie chart for the percentages. Principal residence 7.8% 8.08 Liquid assets 5.0% 18.0 Pension accounts 6.9% 4.84 Stock, funds, and trusts 31.6% 113.76 Businesses & real estate 46.9% 168.84 Miscellaneous 1.8% 6.48 Total 100% 360 Misc. 1.8% Principal Residence 7.8% Liquid Assets 5.0% Pension Accounts 6.9% Businesses & Real Estate 46.9% Stocks, Funds, and Trusts 31.6% Let s do It! The population of federal prisons, according to the most serious offenses, consists of the following. Make a Pie chart of the population. Violent offenses 1.6% Property offenses 8.5% Drug offenses 60.% Public order offenses Weapons 8.% Immigration 4.9% Other 5.6% 9

DATA SET 1.4 Measures of Central Tendency. Suppose you had to give a single number that would represent the most typical age for the 0 subjects. What number would you choose? Measures of center are numerical values that tend to report in some sense the middle of a set of data -- we will focus on the mean and the median. If the data are a sample, the mean and median would be called statistics. If the data form an entire population then these measures of center would be called parameters. Mean Subject # Gender Age 1001 M 45 100 M 41 1003 F 51 1004 F 46 1005 F 47 1006 F 4 1007 M 43 1008 F 50 1009 M 39 1010 M 3 1011 M 41 101 F 44 1013 F 47 1014 F 49 1015 M 45 1016 F 4 1017 M 41 1018 F 40 1019 M 45 100 M 37 DEFINITION: The mean of a set of n observations is simply the sum of the observations divided by the number of observations, n. Mean age of the 0 subjects in the medical study -- add the 0 ages up and divide by 0: Special notation: 45 41 51 46 47 45 37 4335. years 0 If x 1, x,..., x n denote a sample of n observations, then the mean of the sample is called "x-bar" and is denoted by: x i x 1 x x n x n n The mean of a population is denoted by the Greek letter μ. 10

Let s Do it! Mean Number of Children per Household Suppose that the number of children in a simple random sample of 10 households is as follows:, 3, 0,, 1, 0, 3, 0, 1, 4 (a) (b) Calculate the sample mean number of children per household. Suppose that the observation for the last household in the above list was incorrectly recorded as 40 instead of 4.What would happen to the mean? Note that 9 of the 10 observations are less than the mean. The mean is sensitive to extreme observations. Most graphical displays would have detected this outlying observation. Let's Do it! A Mean Is Not Always Representative Kim's test scores are 7, 98, 5, 19, and 6. Calculate Kim's mean test score. Explain why the mean does not do a very good job at summarizing Kim's test scores. Let's Do It! Combining Means We have seven students. The mean score for three of these students is 54 and the mean score for the four other students is 76. What is the mean score for all seven students? The mean = the point of equilibrium, the point where the distribution would balance. 1 3 Mean = 1 5 Mean =.5 If the distribution is symmetric, as in the first picture at the left, the mean would be exactly at the center of the distribution. As the largest observation is moved further to the right, making this observation somewhat extreme, the mean shifts towards the extreme observation. 1 11 Mean =4 If a distribution appears to be skewed, we may wish also to report a more resistant measure of center. 11

The Mean of Group Data /Frequency Tables The procedure for finding the mean for grouped data uses the midpoints of the classes. This procedure is shown next. Example The data represent the number of miles run during one week for a sample of 0 runners. Solution The procedure for finding the mean for grouped datais given here. Step 1 Make a table as shown. Step Find the midpoints of each class and enter them in column C. Step 3 For each class, multiply the frequency by the midpoint, as shown, and place the product in column D. 1.8 = 8,. 13 = 6 etc. The completed table is shown here. Step 4 Find the sum of column D. Step 5 Divide the sum by n to get the mean. 1

Let's Do It! : Eighty randomly selected light bulbs were tested to determine their lifetime in hours. The frequency table of the results is shown in table. Find the average lifetime of a light bulb. Life interval in hours Frequency 53-63 6 64-74 1 75-85 5 86-96 18 97-107 14 108-118 5 Let's Do It! The cost per load (in cents) of 35 laundry detergents tested by consumer organization is given below. Class limit Frequency 13-19 0-6 7 7-33 1 34-40 5 41-47 6 48-54 1 55-61 0 6-68 13

A measure of center that is more resistant to extreme values is the median. Median DEFINITION: The median of a set of n observations, ordered from smallest to largest, is a value such that half of the observations are less than or equal to that value and half the observations are greater than or equal to that value. If the number of observations is odd, the median is the middle observation. If the number of observations is even, the median is any number between the two middle observations, including either of the two middle observations. To be consistent, we will define the median as the mean or average of the two middle observations. Location of the median: (n+1)/, where n is the number of observations. The ages of the n = 0 subjects... Calculating (n+1)/ we get (0+1)/ = 10.5. So the two middle observations are the 10th and 11th observations, namely 43 and 44. The median is the mean of these two middle observations, (43+44)/=43.5 years. 3 37 39 40 41 41 41 4 4 43 44 45 45 45 46 47 47 49 50 51 1 0 th o b s1 1 th o b s m e d ia n = 4 3.5 Let's Do It! 1Median Number of Children per Household Find the median number of children in a household from this sample of 10 households, that is, find the median of Number of Children: 3 0 1 4 0 3 0 1 (a) (b) (c) Note: Median = What happens to the median if the fifth observation in the first list was incorrectly recorded as 40 instead of 4? What happens to the median if the third observation in the first list was incorrectly recorded as -0 instead of 0? The median is resistant that is, it does not change, or changes very little, in response to extreme observations. 14

Percent Another Measure The Mode DEFINITION: The mode of a set of observations is the most frequently occurring value; it is the value having the highest frequency among the observations. The mode of the values: { 0, 0, 0, 0, 1, 1,,, 3, 4 } is 0 For { 0, 0, 0, 1, 1,,,, 3, 4 } two modes, 0 and (bimodal) What would be the mode for { 0, 1,, 4, 5, 8 }? For {0, 0, 0, 0, 0, 1,, 3, 4, 4, 4, 4, 5 }? The mode is not often used as a measure of center for quantitative data. The mode can be computed for qualitative data. The modal race category is white. If categories were given coded as: 1=White, =Asian, 3=African-American, 4=Hispanic, 5=American Indian, 6=No category listed, 80 70 60 50 40 30 then the mode would be the value 1. 0 10 0 American Indian No Category Hispanic African- American Asian White Race 15

Let s Do It! Different Measures Can Give Different Impressions The famous trio the mean, the median, and the mode represent three different methods for finding a so-called center value. These three values may be the same but are more likely going to be different. When they are different, they can lead to different interpretations of the data being summarized. Consider the annual incomes of five families in a neighborhood: (a) $1,000 $1,000 $30,000 $90,000 $100,000 Calculate the average income. (b) (c) (d) Calculate the median income. Calculate the modal income. If you were trying to promote that this is an affluent neighborhood, which measure might you prefer to present? (e) If you were trying to argue against a tax increase, which measure might you prefer to present? (f) If you want to represent these values with the income that is in the middle, which measure might you prefer to present? Which Measure of Center to Use? Bell-shaped, Symmetric Bimodal 50% m e a n = m e d i a n = m o d e mean=median two modes Skewed Right Skewed Left 50% 50% m o d e m e a n m e d i a n m e a n m o d e m e d i a n 16

Mean, Median, and Mode The most common measure of center is the mean, which locates the balancing point of the distribution. The mean equals the sum of the observations, divided by how many there are. The mean is also affected by extreme observations (outliers and values which are far in the tail of a distribution that is skewed). So the mean tends to be a good choice for locating the center of a distribution that is unimodal and roughly symmetric, with no outliers. The median is a more robust measure of center, that is, it is not influenced by extreme values. The median is the middle observation when the data are ordered from smallest to largest. If you have an odd number of values, the median is the one in the middle. If you have an even number of values, the median is the mean of the two middle values, and fall exactly half way between them. If you have n observations, then (n+1)/ tells you the location or position of the median. For skewed distributions or distributions with outliers, the median tends to be the better choice for locating the center. The mode is the value(s) that occurs most often. For a distribution, the mode is the value associated with the highest peak. The most frequent value can be far from the center of the distribution, so the mode is not really a measure of center. However, the mode is the only measure of the three that can be used for qualitative data. Tips: When you see or hear an average reported, ask which average was really computed -- the mean or the median. Think about or examine the distribution of values to assess if the measure of center used is appropriate. 17

.5-.5 MEASURING VARIATION OR SPREAD Both sets of data have the same mean, median and mode but the values obviously differ in another respect -- the variation or spread of the values. The values in List 1 are much more tightly clustered around the center value of 60. The values in List are much more dispersed or spread out. List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 6, 63, 64, 65 mean = median = mode = 60 X X XXXXXXXXXXX. 35 40 45 50 55 60 65 70 75 80 85 List : 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85 mean = median = mode = 60 X X X X X X X X X X X X X. 35 40 45 50 55 60 65 70 75 80 85 Range The range is the simplest measure of variability or spread. Range is just the difference between the largest value and the smallest value. Range can give a distorted picture of the actual pattern of variation. Two distributions: same range but different patterns of variation. the first distribution has most of its values far from the center, while the second distribution has most of its values closer to the center. X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0 1 3 4 5 6 7 8 9 30 0 1 3 4 5 6 7 8 9 30 18

Inter-quartile Range The inter-quartile range measures the spread of the middle 50% of the data. You first find the median (represented by Q the value that divides the data into two halves), and then find the median for each half. The three values that divide the data into four parts are called the quartiles, represented by Q1, Q, and Q3. difference between the third quartile and the first quartile is called the inter-quartile range, denoted by IQR=Q3-Q1. The Example Quartiles for Age The ages of the 0 subjects in the medical study are listed below in order. 3, 37, 39, 40, 41, 41, 41, 4, 4, 43, 44, 45, 45, 45, 46, 47, 47, 49, 50, 51 The histogram of the ages is also provided. (a) (b) (c) (d) Calculate the median age. Calculate the first Quartile Q1 for this age data. Calculate the third Quartile Q3 for this age data. Calculate the range for this age data. 3 37 39 40 41 41 41 4 4 43 44 45 45 45 46 47 47 49 50 51 m e d ian = 4 3.5 Q1 = 4 1 Q3 = 4 6.5 We see that the distribution of age is approximately symmetric and that the quartiles are about the same distance from the median. The quartiles are actually the 5th, 50th, and 75th percentiles. Count 8 6 4 30 35 40 45 50 55 DEFINITION: The pth percentile is the value such that p% of the observations fall at or below that value and (100 - p)% of the observations fall at or above that value. 19

Finding the Quartiles 1. Find the median of all of the observations.. First Quartile = Q1 = median of observations that fall below the median. 3. Third Quartile = Q3 = median of observations that fall above the median. Notes When the number of observations is odd, the middle observation is the median. This observation is not included in either of the two halves when computing Q1 and Q3. Although different books, calculators, and computers may use slightly different ways to compute the quartiles, they are all based on the same idea. In a left-skewed distribution, the first quartile will be farther from the median than the third quartile is. If the distribution is symmetric, the quartiles should be the same distance from the median. Five-Number Summary Five-number summary: Minimum, Q1, Median, Q3, Maximum Boxplot: To Build a Basic Boxplot M i n Q1 Q = M e d i a nq3 M a x List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values determine the box part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median. Draw lines (called whiskers) from the midpoints of the ends of the box out to the minimum and maximum. 0

Problem Consider the (ordered) ages of the 0 subjects in a medical study : 3, 37, 39, 40, 41, 41, 41, 4, 4, 43, 44, 45, 45, 45, 46, 47, 47, 49, 50, 51 The five-number summary for the age data is given by: min = 3, Q1 = 41, median = 43.5, Q3 = 46.5, and max = 51. Draw the Modified boxplot. The distance between the median and the quartiles is roughly the same, supporting the rough symmetry of the distribution as seen previously from the histogram. Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values determine the box part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median. Find the IQR = Q3 Q1. Compute the quantity STEP = 1.5 x (IQR) Find the location of the inner fences by taking 1 step out from each of the quartiles lower inner fence = Q1 STEP; upper inner fence = Q3 + STEP. Draw the lines (whiskers) from the midpoints of the ends of the box out to the smallest and largest values WITHIN the inner fences. Observations that fall OUTSIDE the inner fences are considered potential outliers. If there are any outliers, plot them individually along the scale using a solid dot. 1

Five-number summary: min=1 Q1=1 median=3 Q3=66 max=35 Inner Fences Outside value Potential Outliers Far Outside value Farthest observations that are not potential outliers Example Any Age Outlier? Let s apply the "rule of thumb" to our age data set to assess if there are any outliers. (a) (b) (c) Construct the fences for the modified boxplot based on the 1.5 * IQR rule. Are there any outliers using the 1.5 * IQR rule? Construct the modified boxplot.

Let's Do It! Five-Number Summary and Outliers 3

Let s Do It! Cost of Running Shoes The prices for 1 comparable pairs of running shoes produced the following modified boxplot. 40 60 80 100 10 PRICE * (a) What was the approximate range of prices for such running shoes? Range = (b) Twenty-five percent of the shoes cost more than approximately what amount? $ Side-by-side boxplots are helpful for comparing two or more distributions with respect to the five-number summary. Although the median of the first process is closer to the target value of 0.000 cm, the second process produces a less variable distribution. 4

Let's Do It! Comparing Ages Antibiotic Study Variable = age for 3 children randomly assigned to one of two treatment groups. (a) Give the five-number summary for each of the two treatment groups. Comment on your results. Amoxicillin Group (n=11): 8 9 9 10 10 11 11 1 14 14 17 Five-number summary: Cefadroxil Group (n=1): 7 8 9 9 9 10 10 11 1 13 14 18 Five-number summary: (b) Make side-by-side Modified box-plots for the antibiotic study data in part (a). Amoxicillin : Lower fence=, Upper fence=, outliers: Cefadroxil : Lower fence=, Upper fence=, outliers: 5

Standard Deviation....a measure of the spread of the observations from the mean.. think of the standard deviation as an average (or standard) distance of the observations from the mean. Example Standard Deviation What Is It? Deviations: -4, 1, 3 Squared Deviations: 16, 1, 9 -----------------------------------------------------------------------------------------Observation Deviation Squared Deviation x x x x x ----------------------------------------------------------------------------------------- 0 0-4 = -4 16 5 5-4 = 1 1 7 7-4 = 3 9 ----------------------------------------------------------------------------------------- mean = 4 sum always = 0 sum = 6 sample variance 4 1 3 3 1 sample standard deviation 13 36. 16 1 9 6 13 Interpretation of the Standard Deviation Think of the standard deviation as roughly an average distance of the observations from their mean. If all of the observations are the same, then the standard deviation will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more spread out the observations are about their mean, the larger the value of the standard deviation. 6

If x 1, x,..., x n denote a sample of n observations, the sample variance is denoted by: s xi x x ( ) 1 x x x x x n xi x n i x i n 1 x ( n 1) i / n n 1 n( n 1) Sample standard deviation, denoted by s, is the square root of the variance: s s. Shortcut formulas for computing the variance and standard deviation Mathematically equivalent to the preceding formulas and do not involve using the mean. They save time when repeated subtracting and squaring occur in the original formulas. They are also more accurate when the mean has been rounded.. Remarks: The variance is measured in squared units. By taking the square root of the variance we bring this measure of spread back into the original units. Just as the mean is not a resistant measure of center, since the standard deviation used the mean in its definition, it is not a resistant measure of spread. It is heavily influenced by extreme values. There are statistical arguments that support why we divide by n 1 instead of n in the denominator of the sample standard deviation. 7

Example Consistency of Weight Loss Program In a recent study of the effect of a certain diet on weight reduction, 11 subjects were put on the diet for two weeks and their weight loss/gain in lbs was measured (positive values indicate weight loss). 1, 1,,, 3,, 1, 1, 3,.5, -3. What is the standard deviation of the weight loss? Solution x 1 1.... 5 3 4.5, x 1 1.... 5 ( 3) 569.5 The standard deviation of this sample is s 569. 5 ( 4. 5) / 11 10 7.537 Let's Do It! Emergency Room Patients The following are the ages of a sample of 0 patients seen in the emergency room of a hospital on a Friday night. 35 3 1 43 39 60 36 1 54 45 37 53 45 3 64 10 34 36 55 Find the standard deviation of the ages. 8

IQR and Standard Deviation The interquartile range, IQR, is the distance between the first and third quartiles (Q3 - Q1), and measures the spread of the middle 50% of the data. When the median is used as a measure of center, the IQR is often used as a measure of spread. For skewed distributions, or distributions with outliers, the IQR tends to be a better measure of spread if your goal is to summarize that distribution. Adding the minimum and maximum values to the median and quartiles results in the five-number summary. A graphical display of the five-number summary is a boxplot, and the length of the box corresponds to the IQR. The standard deviation is roughly the average distance of the observed values from their mean. The mean and the standard deviation are most useful for approximately symmetric distributions with no outliers. In the next chapter we will discuss an important family of symmetric distributions, called the normal distributions, for which the standard deviation is a very useful summary. Tip: The numerical summaries presented in this chapter provide information about the center and spread of a distribution, but a graph, such as a histogram or stem-andleaf plot, provides the best picture of the overall shape of the distribution. Graph your data first! 9

Variance and Standard Deviation for Grouped Data The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class. Example The data represent the number of miles that 0 runners ran during one week. Find the variance and the standard deviation for the frequency distribution of the data. Solution Step1 Make a table as shown, and find the midpoint of each class. Step Multiply the frequency by the midpoint for each class, and place the products in column D. 1.8 = 8,. 13 =6,...,.38 = 76 Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E. 1.8 = 64,. 13 = 338,...,.38 = 888 Step 4 Find the sums of columns B, D, and E. The sum of column B is n, the sum of column D is E is f i x m. The completed table is shown. f i x m, and the sum of column Step 5 Substitute in the formula and solve for s to get the variance. Step 6 Take the square root to get the standard deviation. 30

Let's Do It! The data show distribution of the birth weight ( in oz.) of 100 consecutive deliveries. Find the variance and the standard deviation. Interval Frequency 9.50-69.45 5 69.50-89.45 10 89.50-99.45 11 99.50-109.45 19 109.50-119.45 17 119.50-19.45 0 19.50-139.45 1 139.50-149.45 6 Chapter Objectives: Identify Types of Variables: (Quantitative / Categorical). Compute percentages and answer questions using charts and histograms. Identify misleading Pictograms. Construct frequency table for categorical data. Construct histograms of frequency tables. Construct frequency table from a histogram. Understand that histograms match the characteristics of the population Compute Measures of Central Tendency (the three m s: mean/median/mode) of data sets. Combining Means: computing the overall mean of two groups using their averages (chapter handout, Kim s Scores) Understand the Effect of Extreme Values on the mean (sensitive)/ median (resistant). Compute the Mean of a Frequency Table Effect of the Shape of the Distribution on the Mean, Median, Mode Compute the Range of a data set. Find the 5-Number Summary (Min, Q1, Median, Q3, Max). Draw basic box-plot Find the 5-Number Summary from a Modified Box-Plot. Identify Potential outliers using the 1.5*IQR Rule. Draw a Modified Box-Plot. Remember that the Variance and Standard Deviation of a Sample is Different (in formula) from the Population s. Compute the Variance and Standard Deviation of Data Sets. Compute the Variance and Standard Deviation of a Frequency Table. 31