CHAPTER 1. Introduction

Similar documents
Describing distributions with numbers

Describing distributions with numbers

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 2: Tools for Exploring Univariate Data

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

are the objects described by a set of data. They may be people, animals or things.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 1: Exploring Data

Chapter 5: Exploring Data: Distributions Lesson Plan

Exercises from Chapter 3, Section 1

3.1 Measure of Center

Math 140 Introductory Statistics

Math 140 Introductory Statistics

CHAPTER 2: Describing Distributions with Numbers

********************************************************************************************************

Lecture 1: Descriptive Statistics

Chapter 3. Data Description

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

MATH 1150 Chapter 2 Notation and Terminology

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Section 3. Measures of Variation

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Resistant Measure - A statistic that is not affected very much by extreme observations.

Histograms allow a visual interpretation

Chapter 4. Displaying and Summarizing. Quantitative Data

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

1.3: Describing Quantitative Data with Numbers

Sampling, Frequency Distributions, and Graphs (12.1)

Chapter 3. Measuring data

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Introduction to Statistics

AP Final Review II Exploring Data (20% 30%)

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

2011 Pearson Education, Inc

Chapter 5: Exploring Data: Distributions Lesson Plan

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 1 - Lecture 3 Measures of Location

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Lecture 2 and Lecture 3

Elementary Statistics

TOPIC: Descriptive Statistics Single Variable

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Range The range is the simplest of the three measures and is defined now.

STAT 200 Chapter 1 Looking at Data - Distributions

Unit 2. Describing Data: Numerical

Describing Distributions with Numbers

1 Measures of the Center of a Distribution

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

MATH 117 Statistical Methods for Management I Chapter Three

1. Exploratory Data Analysis

Summarising numerical data

1.3.1 Measuring Center: The Mean

P8130: Biostatistical Methods I

Week 1: Intro to R and EDA

Chapter 6 Group Activity - SOLUTIONS

Lecture 1: Description of Data. Readings: Sections 1.2,

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Topic 5: Statistics 5.3 Cumulative Frequency Paper 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Units. Exploratory Data Analysis. Variables. Student Data

STOR 155 Introductory Statistics. Lecture 4: Displaying Distributions with Numbers (II)

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Measures of the Location of the Data

Performance of fourth-grade students on an agility test

Determining the Spread of a Distribution

Number of fillings Frequency q 4 1. (a) Find the value of q. (2)

Chapter 1. Looking at Data

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

Section 3.2 Measures of Central Tendency

Determining the Spread of a Distribution

STRAND E: STATISTICS. UNIT E4 Measures of Variation: Text * * Contents. Section. E4.1 Cumulative Frequency. E4.2 Box and Whisker Plots

Example 2. Given the data below, complete the chart:

Descriptive Statistics

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Chapter 3 Data Description

Homework Example Chapter 1 Similar to Problem #14

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

ECLT 5810 Data Preprocessing. Prof. Wai Lam

IB Questionbank Mathematical Studies 3rd edition. Grouped discrete. 184 min 183 marks

Revision Topic 13: Statistics 1

Unit 2: Numerical Descriptive Measures

Lecture 11. Data Description Estimation

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

The empirical ( ) rule

Honors Algebra 1 - Fall Final Review

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

CIVL 7012/8012. Collection and Analysis of Information

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

GRACEY/STATISTICS CH. 3. CHAPTER PROBLEM Do women really talk more than men? Science, Vol. 317, No. 5834). The study

Instructor: Doug Ensley Course: MAT Applied Statistics - Ensley

MATH-A Day 8 - Stats Exam not valid for Paper Pencil Test Sessions

A is one of the categories into which qualitative data can be classified.

Descriptive Statistics-I. Dr Mahmoud Alhussami

ORGANIZATION AND DESCRIPTION OF DATA

Stat 101 Exam 1 Important Formulas and Concepts 1

Transcription:

CHAPTER 1 Introduction Engineers and scientists are constantly exposed to collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data, and for drawing conclusions based on information contained in the data. An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statistics. Descriptive Statistics consists of methods for organizing and summarizing information. Some of these methods involve calculation of numerical summary measures, such as means, standard deviations, etc. Other descriptive methods are graphical in nature. Notation: Given a data set consisting of n observations on some variable x, the individual observations will be denoted by x 1, x 2, x 3,..., x n (though any other letter could be used in place of x). Measures of Location (1.4) Descriptive measures that indicate where the center or most typical value of a data set lies called measures of center. measures of center sample mean sample median The sample mean x of observations is given by x 1, x 2,..., x n x = x 1 + x 2 +... + x n n n i=1 = x i n The sample median is the number that divides the bottom 50% of the data from the top 50%. To obtain the median of a data set, we arrange the data in increasing order and then determine the middle value in the ordered list. When the observations are denoted by x 1, x 2,..., x n, we will use the symbol x to represent the sample median. How to find the sample median 1

If the number of observations n is odd, then the sample median x is the observation exactly in the middle of the ordered list, i.e. the ( n+1 )th 2 ordered value. If the number of observations is even, then x is the mean of the two middle observations in the ordered list, i.e. the average of ( n 2 )th and ( n 2 + 1) th ordered values. Ex.1 on p. 20: A manufacturer of electronic components is interested in determining the lifetime of a certain type of battery. A sample, in hours of life, is as follows: 123, 116, 122, 110, 175, 126, 125, 111, 118, 117 a. Find the sample mean and median. b. What feature in this data set is responsible for the substantial difference between the two? Trimmed means: Motivation Note that x and x are at opposite extremes of the same family of measures. After the data set is ordered, x is computed by throwing away as many values on each end as one can without eliminating everything (leaving just one or two middle values) and averaging what is left, whereas to compute x one throws away nothing before averaging. To paraphrase, the mean involves trimming 0% from each end of the sample, whereas for the median the maximum possible amount is trimmed from each end. A trimmed mean is a compromise between x and x. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what is left over. Example: The following is a set of algebra final exam scores: a. The sample mean = b. The sample median = c. The 10% trimmed mean of the data = 0 58 61 63 67 69 70 71 78 80 Measures of Variability (1.5) Different samples may have identical measures of center yet differ from one another in other important ways. The first sample has the largest amount of variability, the third has the smallest amount. The sample range is the difference between the largest and the smallest sample value. 2

30 40 50 60 70 1 2 3 Figure 1: Samples with identical measures of center but different amounts of variability 3

The sample variance: The deviation from the mean are x 1 x, x 2 x,..., x n x. A simple way to combine the deviations into a single quantity is to average them. Unfortunately, there is a major problem with this suggestion: so that the average deviation is always zero. sum of deviations = n (x i x) = 0 How can we change the deviations to nonnegative quantities so the positive and negative deviations do not counteract one another when they are combined? One possibility is to work with the absolute values of the deviations and calculate the average absolute deviation x i x /n. Because the absolute value operation leads to a number of theoretical difficulties, consider instead (x i x) 2 /n, but for several reasons we will divide the sum of squared deviations by n 1 rather than n. i=1 The sample variance, denoted by s 2, is given by s 2 (xi x) 2 = n 1 The sample standard deviation, denoted by s, is the (positive) square root of the variance: s = s 2 An alternative expression for the numerator of s 2 is (xi x) 2 = x 2 i ( x i ) 2 n The unit for s is the same as the unit for each of the x i s Example: The Bureau of Labor Statistics lists average hourly waged for 8 categories of law-related occupations. The unites are dollars per hour. 30 17 36 14 17 12 15 17 The sample mean is x = 30 + 17 + 36 +... 17 8 = 158 8 = 19.7517 4

15 20 25 30 35 19.75 Figure 2: Average hourly wages The deviations show how spread out the data are about their mean. Observations Deviations Squared deviations x i x i x (x i x) 2 30 30 19.75 = 10.25 10.25 2 = 105.0625 17 17 19.75 = 2.75 ( 2.75) 2 = 7.5625 36 36 19.75 = 16.25 16.25 2 = 264.0625 14 41 19.75 = 5.75 ( 5.75) 2 = 33.0625 17 17 19.75 = 2.75 ( 2.75) 2 = 7.5625 12 12 19.75 = 7.75 ( 7.75) 2 = 60.0625 15 15 19.75 = 4.75 ( 4.75) 2 = 22.5625 17 17 19.75 = 2.75 ( 2.75) 2 = 7.5625 sum = 0 sum = 507.5 The standard deviation is s 2 = 507.5 7 = 72.5 s = 72.5 = 8.5147 $ per hour Discrete and Continuous Data (1.6) A variable is discrete if its set of possible values either is finite or else can be listed in an infinite sequence. A variable is continuous if its possible values consist of an entire interval on the number line. 5

Graphical Methods and Data Description (1.7) Stem-and-Leaf Displays (stem-plot): Example: Consider the following humidity readings rounded to the nearest percent: 29 44 12 53 21 34 39 25 48 23 17 24 27 32 34 15 42 21 28 37 This can also be written as Stem Leaf 1 2 5 7 2 1 1 3 4 5 7 8 9 3 2 4 4 7 9 4 2 4 8 5 3 Steps for Constructing a Stem-and-Leaf Display Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Histograms A frequency distribution is a table that divides a set of data into a suitable number of classes (categories), showing also the number of items belonging to each class. For the previous example Stem Leaf 1 2 5 7 2 1 1 3 4 5 7 8 9 3 2 4 4 7 9 4 2 4 8 5 3 we might group these data into the following distribution: Class interval Class midpoint Frequency, f Relative frequency 10 19 14.5 3 0.15 20 29 24.5 8 0.40 30 39 34.5 5 0.25 40 49 44.5 3 0.15 50 59 54.5 1 0.05 6

Class interval Class midpoint Frequency, f Relative frequency 10 19 14.5 3 0.15 20 29 24.5 8 0.40 30 39 34.5 5 0.25 40 49 44.5 3 0.15 50 59 54.5 1 0.05 Relative frequency histogram Histogram Shapes A histogram is symmetric if the left half is a mirror image of the right half. A histogram is positively skewed if the right or upper tail is stretched out compared with the left or lower tail A histogram is negatively skewed if the stretching is to the left. Using Minitab Example: Ex 7 on p. 20 The following data represent the length of life in years, measured to the nearest tenth, of 30 similar fuel pumps: 2.0 3.0 0.3 3.3 1.3 0.4 0.2 6.0 5.5 6.5 0.2 2.3 1.5 4.0 5.9 1.8 4.7 0.7 4.5 0.3 1.5 0.5 2.5 5.0 1.0 6.0 5.6 6.0 1.2 0.2 First we store the data in column C1. Then we proceed in the following manner. Mean Choose Calc > Column Statistics... Select the Mean option bottom from the Statistic field. Click in the Input variable text box and specify C1. Click OK Median 7

Using the same steps except selecting the Median option bottom instead of the Mean option bottom, we obtain the median Stem-and-Leaf Choose Graph > Stem-and-Leaf Specify C1 in the Variable text box. Click in the Increment text box and type 1 (This tells Minitab that the distance between the smallest possible number on one line and the smallest possible number on the next line should be 1 Click OK Histogram Choose Graph > Histogram Specify C1 in the X text box for Graph 1 Click OK Fundamental Sampling Distribution and Data Description: Ch 8 Data Display and Graphical Methods Quartiles: The quartiles of a data set divide it into quarters, or four equal parts. The first quartiles, is the number that divides the bottom 25% of the data from the top 75% ; the second quartile is the median; and the third quartile, is the number that divides the bottom 75% of the data from the top 25%. The first and third quartiles are the 25th and 75th percentiles, respectively The interquartile range, denoted IQR, is the difference between the first and third quartiles; that is, Roughly speaking, the IQR gives the range of the middle 50% of the observations. Outliers: are observations that fall well outside the overall pattern of the data. We can use quartiles and the IQR to identify potential outliers. Outliers are numbers that lie, respectively, 1.5 IQRs below the first quartile and 1.5 IQRs above the third quartile Box and Whisker Plot or Boxplot We will go over Example 8.3 on p. 202 using minitab. Comparative Boxplots 8

A comparative or side-by-side boxplot is a very effective way of revealing similarities and differences between two or more data sets. We will go over Ex 3. from ch 1 using minitab. Comparative Stem-Plots A study at 35 large city high schools given the following back-to-back stem-plot of the percentages of students who say they have tried alcohol. School Year School Year 1995 96 1998 99 0 2 1 9 2 6 6 3 3 1 8 4 3 1 1 4 0 2 9 9 9 8 6 5 3 2 2 0 5 3 3 4 6 9 8 7 6 6 5 1 1 6 1 2 2 2 7 5 4 3 2 2 7 0 1 3 3 5 5 6 7 8 9 5 4 0 8 2 3 4 5 8 8 9 0 9 0 1 1 2 Which of the following does not follow from the above data? (A) In general, the percentage of students trying alcohol seems to have increased from 1995 96 to 1998 99 (B)The median alcohol percentage among the 35 schools increased from 1995 96 to 1998 99. (C) The spread between the the lowest and highest alcohol percentages decreased from 1995 96 to 1998 99. (D) For both school years in most of the 35 schools, most of the students said they had tried alcohol. (E) The percentage of students trying alcohol increased in each of the schools between 1995 96 and 1998 99. 9

Review 1. A sample was taken of the salaries of four employees from a large company. The following are their salaries (in thousands of dollars) for this year. The variance of their salaries is a. 5.1 b. 26 c. 31 2. Consider the following Stem-and-Leaf plot. 1 6 2 2 4 8 9 3 0 1 1 2 3 6 7 8 4 0 5 8 5 0 1 8 6 1 The mean of the data represented in this stem-plot is a. 34.5 b. 37 c. cannot be computed from the information given. 33 31 24 36 3. In a class of 25 students, 22 students had grades between 71 and 80 (both endpoints included), and three students had grades between 91 and 100 (both endpoints included). For these data, a. the median must be between 71 and 80. b. the mean must be between 71 and 80. c. both of the above. 10

2 4 6 8 10 Figure 3: Physical Flexibility Rating 0 20 40 60 80 2 4 6 8 10 x Figure 4: Middle-Aged Men 4. Five hundred randomly selected middle-aged men and five hundred randomly selected young adult men were rated on a scale from 1 to 10 on their physical flexibility, with 10 being the most flexible. Their rating appear in the frequency table below. For example, 17 middle-aged men had a flexibility rating of 1. Frequency of Frequency of Physical Flexibility Rating Middle-aged Men Young Adult Men 1 17 4 2 31 17 3 49 29 4 71 39 5 70 54 6 87 69 7 78 83 8 54 93 9 34 73 10 9 39 (a) Display these data graphically so that the flexibility of middle-aged men and young adult men can be easily compared. (b) Write a few sentences comparing the flexibility of middle-aged men with the flexibility of young adult 11

2 4 6 8 10 0 20 40 60 80 y Figure 5: Young Adult Men 5. A consumer advocate conducted a test of two popular gasoline additives, A and B. There are claims that the use of either of these additives will increase gasoline mileage in cars. A random sample of 30 cars was selected. Each car was filled with gasoline and the cars were run under the same driving conditions until the gas tanks were empty. The distance traveled was recorded for each car. Additive A was randomly assigned to 15 of the cars and additive B was randomly assigned to the other 15 cars. The gas tank of each car was filled with gasoline and the assigned additive. The cars were again run under the same driving conditions until the tanks were empty. The distance traveled was recorded and the difference in the distance with the additive minus the distance without the additive for each car was calculated. The following table summarizes the calculated differences. Note that negative values indicate less distance was traveled with the additive than without the additive. Additive Values Below Q 1 Q 1 Median Q 3 Values above Q 3 A 10, 8, 2 1 3 4 5, 7, 9 B 5, 3, 3 2 1 25 35, 37, 40 (a) On the grid below, display parallel boxplots (showing outliers if any) of the differences of the two additives. (b) Which additive, A or B, would you recommend if the goal is to increase gas mileage in the highest proportion of cars? Explain your choice. (c) Which additive, A or B, would you recommend if the goal is to have the highest mean increase in gas mileage? Explain your choice. 12