Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Similar documents
MATH 1150 Chapter 2 Notation and Terminology

Describing distributions with numbers

Elementary Statistics

Describing distributions with numbers

STT 315 This lecture is based on Chapter 2 of the textbook.

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

The empirical ( ) rule

are the objects described by a set of data. They may be people, animals or things.

Chapter 2: Tools for Exploring Univariate Data

Unit Six Information. EOCT Domain & Weight: Algebra Connections to Statistics and Probability - 15%

Section 3. Measures of Variation

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Unit 2. Describing Data: Numerical

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

1.3.1 Measuring Center: The Mean

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

AP Final Review II Exploring Data (20% 30%)

6 THE NORMAL DISTRIBUTION

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Resistant Measure - A statistic that is not affected very much by extreme observations.

A is one of the categories into which qualitative data can be classified.

Chapter 3. Data Description

Units. Exploratory Data Analysis. Variables. Student Data

CHAPTER 1. Introduction

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

Sem. 1 Review Ch. 1-3

Chapters 1 & 2 Exam Review

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Chapter. Numerically Summarizing Data Pearson Prentice Hall. All rights reserved

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Scatterplots and Correlations

Lecture 1: Description of Data. Readings: Sections 1.2,

Lecture 10/Chapter 8 Bell-Shaped Curves & Other Shapes. From a Histogram to a Frequency Curve Standard Score Using Normal Table Empirical Rule

Lecture 1: Descriptive Statistics

Chapter 2 Solutions Page 15 of 28

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

STAT 200 Chapter 1 Looking at Data - Distributions

Descriptive Univariate Statistics and Bivariate Correlation

Practice Questions for Exam 1

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Sampling, Frequency Distributions, and Graphs (12.1)

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Statistics 528: Homework 2 Solutions

CHAPTER 2: Describing Distributions with Numbers

A C E. Answers Investigation 4. Applications

Exercises from Chapter 3, Section 1

Chapter 6 Group Activity - SOLUTIONS

Chapter 5: Exploring Data: Distributions Lesson Plan

Example 2. Given the data below, complete the chart:

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Introduction to Statistics

Measures of the Location of the Data

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Sociology 6Z03 Review I

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Continuous random variables

Chapter2 Description of samples and populations. 2.1 Introduction.

Statistics for Managers using Microsoft Excel 6 th Edition

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 3: The Normal Distributions

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Section 3.2 Measures of Central Tendency

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Essential Question: What are the standard intervals for a normal distribution? How are these intervals used to solve problems?

Ch. 3 Review - LSRL AP Stats

3.1 Measure of Center

Chapter 4. Displaying and Summarizing. Quantitative Data

Unit 1: Statistics. Mrs. Valentine Math III

UCLA STAT 10 Statistical Reasoning - Midterm Review Solutions Observational Studies, Designed Experiments & Surveys

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

Describing Distributions

Chapter 1: Exploring Data

Scatterplots and Correlation

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

IB Questionbank Mathematical Studies 3rd edition. Grouped discrete. 184 min 183 marks

Chapter 3: Examining Relationships

CIVL 7012/8012. Collection and Analysis of Information

Instructor: Doug Ensley Course: MAT Applied Statistics - Ensley

MATH 10 INTRODUCTORY STATISTICS

Chapter 1. Looking at Data

Summarising numerical data

AP Statistics Cumulative AP Exam Study Guide

Francine s bone density is 1.45 standard deviations below the mean hip bone density for 25-year-old women of 956 grams/cm 2.

Histograms allow a visual interpretation

Statistics Lecture 3

σ. We further know that if the sample is from a normal distribution then the sampling STAT 2507 Assignment # 3 (Chapters 7 & 8)

1. Exploratory Data Analysis

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Lecture 2 and Lecture 3

Describing Distributions With Numbers

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Transcription:

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore Chapter 3 continued Describing distributions with numbers Measuring spread of data: Quartiles Definition 1: The interquartile range (IQR) of a set of measurements is defined to be the difference between the upper and lower quartiles, i.e. IQR = Q 3 Q 1. As we have seen from box-and-whisker plots, the interquartile range is especially useful when comparing the spreads of two distributions. The IQR can also be used to detect outliers: Example 1: The 1.5 IQR criterion. A common criterion for detecting suspected outliers in a data set is as follows: Call an observation an outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. The data on the volume of acorns (in cubic centimeters) from 39 species of oaks are given in today s Minitab worksheet. Use a stem-and-leaf plot to find the outliers. Then see whether these satisfy the 1.5 IQR criterion. Measuring spread of data: variance and standard deviation Recall that x denotes the mean of a set x 1,...,x n of observations. Definition 2: Deviations The deviations of the data set x 1,...,x n are the numbers x 1 x, x 2 x,...,x n x Definition 3: Variance The variance s 2 of the data set x 1,...,x n is s 2 = (x 1 x) 2 +(x 2 x) 2 +...+(x n x) 2 n 1 = 1 X (xi x) 2. n 1 1

Definition 4: Standard deviation The standard deviation s of the data set x 1,...,x n is the nonnegative square root of the variance, i.e. r 1 X s = (xi x) n 1 2. Why we divide by n 1 when computing s and s 2. We denote by σ 2 the variance of measurements for a whole population, while s 2 is used to denote the variance of the measurements from a sample of the population. Suppose that we wanted to estimate the variance σ 2 of the heights of all the adults in the world. Obviously we can t compute σ 2 exactly, but we can compute the variance s 2 of a random sample of the population. We hope that s 2 will be close to σ 2. In fact, let s suppose that we select many random samples, and compute the variances s 2 1,s2 2,... for each sample using the formula on the preceding page. Then the average of s 2 1,s 2 2,... would be close to σ 2. For this reason, s 2 is called an unbiased estimator for σ 2. On the other hand, suppose that we computed s 2 by dividing by n instead of n 1. Then the average of s 2 1,s 2 2,... would underestimate σ 2. Properties of the standard deviation s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s =0only if there is no spread, which happens only when all the observations have the same value. As the observations become more spread out about their mean, s gets larger. s has the same units as the original observations. For example, if the data set is weights of people in pounds, then s also has units of pounds. This is one reason to prefer s to s 2, which has units of pounds squared. s is not resistant (to outliers). Strong outliers or skewness can greatly increase s. Choosing a summary of data The five number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x and s only for reasonably symmetric distributions that are free of outliers. A graph gives the best overall picture of a distribution. There are certain features of a distribution, such as gaps, that are not revealed by numerical summaries. Always plot your data. 2

Example 2: Roger Maris. New York Yankee Roger Maris held the singleseason home-run record from 1961 until 1998. Here are Maris s home run counts for his 10 years in the American League (these are also in today s Minitab worksheet): 14, 28, 16, 39, 61, 33, 23, 26, 8, 13. (a) Make a stem-and-leaf plot of the data. Which is the outlier? (b) Use Minitab to find x and s. (c) Now find x and s for the 9 observations that remain when you leave out the outlier. How does the outlier affect the values of x and s? Example 3: State SAT scores. Average SAT scores for the states and the District of Columbia are given in today s worksheet. Find the basic statistics for both the math and verbal scores separately. Then construct stem-and-leaf plots for the math and verbal scores separately. What important feature of the distributions do the numerical summaries fail to reveal? The Empirical Rule Suppose that a data set has a "mound" or "bell-shaped" histogram. This means that the histogram has a single peak, is symmetric, and tapers off gradually in the tails. Let x be the mean and s be the standard deviation of the data. Then the Empirical Rule, or 68-95-99.7 rule, says that 68% of the data lies between x s and x + s 95% of the data lies between x 2s and x +2s 99.7% of the data lies between x 3s and x +3s 3

Example 4: A histogram of the heights of 1000 women aged 18 to 24 years of age was found to have a bell shape. Also, the mean and standard deviation of the heights are 64.5 inches and 2.5 inches, respectively. (a) About how many of the women are taller than 66 inches? (b) About how many of the women are taller than 59.5 inches but shorter than 66 inches? Summarizing Data from More Than One Variable Contingency table Also called an r c contingency table, where r =number of rows and c =number of columns. Used to summarize data from two qualitative (i.e. categorical) variables. 4

Example 5: A company operates four machines three shifts each day. From production records, the following data on the number of breakdowns are collected. Thisisa3 4 contingency table. Number of breakdowns Stacked bar graph Machines Shift A B C D 1 41 20 12 16 2 31 11 9 14 3 15 17 16 10 Example 6. Refer to the preceding table. For each machine separately, we want to display the percentages of breakdowns of that machine that occured in shifts 1, 2, and 3. To do this we can use a stacked bar graph. First, the tables below are computed. In the second table, each column contains the percentages of breakdowns that occured in shifts 1, 2, 3, for that particular machine. Number of breakdowns Machines Shift A B C D 1 41 20 12 16 2 31 11 9 14 3 15 17 16 10 Total 87 48 37 40 Percentages of breakdowns Machines Shift A B C D 1 47.1 41.7 32.4 40 2 35.6 22.9 24.3 35 3 17.2 35.4 43.2 25 Total 99.9 100 99.9 100 Now, to make the stacked bar graph, place A, B, C, D on the horizontal axis. For each of A, B, C, D, stack three blocks whose heights equal the percentages for shifts 1, 2, and 3. 5

Cluster bar graph A cluster bar graph displays the relationship between a combination of quantitative variables and a single qualitative (categorical) variable. The qualitative variables go on the horizontal axis and the quantitative variable goes on the vertical axis. Example 7: Majors for men and women. A study of the career plans of women and men was made. One question asked which major the student had chosen. Here are the data: Female Male Accounting 68 56 Administration 91 40 Economics 5 6 Finance 61 59 Make a cluster bar graph of the data, where each cluster of bars corresponds to a major. What is another way to make a cluster bar graph of this data? 6

Scatterplots A scatterplot is used to display the relationship between two quantitative variables. Definition 5: Explanatory and response variables. Given a pair of related variables, the variable that causes changes in the other variable is called the explanatory variable. The other variable is called the response variable. Example 8: There is a relationship between the altitude of a city and the air pressure in that city. Which variable is the explanatory variable and which variableistheresponsevariable? In a scatterplot, we place the explanatory variable on the horizontal axis and theresponsevariableontheverticalaxis. Example 9: Heating a home. For each of 16 months, a household records average natural gas consumption (in hundreds of cubic feet) and then number of degree-days for that month (one degree day is accumulated for each degree a day s average temperature falls below 65. An average temperature of 20 F, for example, corresponds to 45 degree days). The data is given in today s Minitab worksheet. Make a scatterplot of the data. Examining a scatterplot Look for the overall pattern and for striking deviations from the pattern. You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship. Positive association and negative association Two variables are positively associated when above-average values of one tend to accompany above-average values of the other and below-average values also tend to occur together. Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice-versa. Example 10: Thoroughly describe the scatter plot from example 9. 7