OBJECTIVES INTRODUCTION

Similar documents
Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

RAFIA(MBA) TUTOR S UPLOADED FILE Course STA301: Statistics and Probability Lecture No 1 to 5

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Chapter 3 Data Description

Feature Extraction Techniques

Block designs and statistics

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

a a a a a a a m a b a b

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Describing distributions with numbers

Chapter 3. Data Description. McGraw-Hill, Bluman, 7 th ed, Chapter 3

Lecture 11. Data Description Estimation

PHY 171. Lecture 14. (February 16, 2012)

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Example A1: Preparation of a Calibration Standard

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

What is Probability? (again)

Statistics for Managers using Microsoft Excel 6 th Edition

Principal Components Analysis

Estimation of the Population Mean Based on Extremes Ranked Set Sampling

MATH 117 Statistical Methods for Management I Chapter Three

3.3 Variational Characterization of Singular Values

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Chapter 3. Data Description

2011 Pearson Education, Inc

In this chapter, we consider several graph-theoretic and probabilistic models

Pattern Recognition and Machine Learning. Artificial Neural networks

Chapter 2: Tools for Exploring Univariate Data

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

Topic 5a Introduction to Curve Fitting & Linear Regression

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Biostatistics Department Technical Report

SHAPE IDENTIFICATION USING DISTRIBUTED STRAIN DATA FROM EMBEDDED OPTICAL FIBER SENSORS

Describing distributions with numbers

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Chapter 6 1-D Continuous Groups

Chaotic Coupled Map Lattices

Chapter 6: Economic Inequality

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Non-Parametric Non-Line-of-Sight Identification 1

Ch 12: Variations on Backpropagation

Curious Bounds for Floor Function Sums

A proposal for a First-Citation-Speed-Index Link Peer-reviewed author version

Descriptive Statistics-I. Dr Mahmoud Alhussami

Numerical Measures of Central Tendency

Chapter 4. Displaying and Summarizing. Quantitative Data

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Example 2. Given the data below, complete the chart:

An Introduction to Meta-Analysis

National 5 Summary Notes

Range The range is the simplest of the three measures and is defined now.

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

General Properties of Radiation Detectors Supplements

Testing equality of variances for multiple univariate normal populations

STAT 200 Chapter 1 Looking at Data - Distributions

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Machine Learning Basics: Estimators, Bias and Variance

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Research in Area of Longevity of Sylphon Scraies

CHAPTER 1. Introduction

Measures of Location. Measures of position are used to describe the relative location of an observation

UNCERTAINTIES IN THE APPLICATION OF ATMOSPHERIC AND ALTITUDE CORRECTIONS AS RECOMMENDED IN IEC STANDARDS

ALGEBRA REVIEW. MULTINOMIAL An algebraic expression consisting of more than one term.

The Wilson Model of Cortical Neurons Richard B. Wells

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Probability Distributions

Meta-Analytic Interval Estimation for Bivariate Correlations

COS 424: Interacting with Data. Written Exercises

On the Maximum Likelihood Estimation of Weibull Distribution with Lifetime Data of Hard Disk Drives

Measuring Temperature with a Silicon Diode

CIVL 7012/8012. Collection and Analysis of Information

TOPIC: Descriptive Statistics Single Variable

Kernel Methods and Support Vector Machines

Chapter 1: Exploring Data

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Introduction to Statistics

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

Sampling How Big a Sample?

Birthday Paradox Calculations and Approximation

Polygonal Designs: Existence and Construction

A remark on a success rate model for DPA and CPA

3.1 Measure of Center

Elementary Statistics

Chapter 2 General Properties of Radiation Detectors

Chapter. Numerically Summarizing Data Pearson Prentice Hall. All rights reserved

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Page 1 Lab 1 Elementary Matrix and Linear Algebra Spring 2011

Unit 2. Describing Data: Numerical

1.3: Describing Quantitative Data with Numbers

Determining the Spread of a Distribution

Generalized Augmentation for Control of the k-familywise Error Rate

Multi-Scale/Multi-Resolution: Wavelet Transform

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Supplementary Information for Design of Bending Multi-Layer Electroactive Polymer Actuators

Determining the Spread of a Distribution

Problem Set 2. Chapter 1 Numerical:

Transcription:

M7 Chapter 3 Section 1 OBJECTIVES Suarize data using easures of central tendency, such as the ean, edian, ode, and idrange. Describe data using the easures of variation, such as the range, variance, and standard deviation. Identify the position of a data value in a data set using various easures of position, such as percentiles, deciles, and quartiles. Use the techniques of exploratory data analysis, including boxplots and five-nuber suaries to discover various aspects of data. INTRODUCTION Statistical ethods can be used to suarize data. Measures of average are also called easures of central tendency and include the ean, edian, ode, and idrange. They easure the center of the distribution or the ost typical case. Measures that deterine the spread of data values are called easures of variation or easures of dispersion and include the range, variance, and standard deviation. Measures of position tell where a specific data value falls within the data set or its relative position in coparison with other data values. The ost coon easures of position are percentiles, deciles, and quartiles. The easures of central tendency, variation, and position are part of what is called traditional statistics. This type of data is typically used to confir conjectures about the data. Another type of statistics is called exploratory data analysis. These techniques include the box plot and the five-nuber suary. They can be used to explore data to see what they show. BASIC VOCABULARY A statistic is a characteristic or easure obtained by using the data values fro a saple. Roan letters are used to represent statistics (exaple: the sybol X represents the saple ean) A paraeter is a characteristic or easure obtained by using all the data values for a specific population. Greek letters are used to represent paraeters (exaple: the greek letter µ (u) is used to represent the population ean) When the data in a data set is ordered it is called a data array. Section 3-1 Page 1

M7 Chapter 3 Section MEASURES OF CENTRAL TENDENCY Mean Median Mode Midrange Mean (also known as Arithetic Average): The ean is the su of the values, divided by the total nuber of values. The sybol X represents the saple ean: X X X Xn X 1 + + 3 + + X = = n n For a population, we use the Greek letter µ to denote the ean. Rounding Rule for ean: one ore decial place than occurs in raw data. Exaple: Find the ean of the saple: 0, 6, 40, 36, 3, 4, 35, 4, 30. X 0 + 6 + 40 + 36 + 3+ 4 + 35 + 4 + 30 76 X = = = = 30.7 n 9 9 Mean for Grouped Data : f X X = n Exaple 3-3 A Class B Frequency (f) Section 3- Page C Midpoint ( X ) D f X 5.5-10.5 1 8 8 10.5-15.5 13 6 15.5-0.5 3 18 54 0.5-5.5 6 3 115 5.5-30.5 4 8 11 30.5-35.5 3 33 99 35.5-40.5 38 76 n=0 f = 490 f X 490 X = = = 4.5 n 0 Procedure: Step 1: Make a table with 4 coluns, and populate the first two coluns with the class boundaries and frequency values Step : Find the Midpoint ( X ) of each class and enter it in Colun C Step 3: For each class, ultiply the frequency ( f ) by its idpoint and place the product in Colun D. Step 4: Find the su of colun D Step 5: Divide this su by the su of all the frequencies. X

M7 Chapter 3 Section Median The edian is the halfway point in a data array. Sybol is MD (Data array iplies that the data ust be arranged in order) Exaple: Find MD of the values: 713, 300, 618, 595, 311, 401, 9 Step 1: Arrange data in order: 9, 300, 311, 401, 595, 618, 713 Step : Select the iddle point: 401 Hence MD = 401 Exaple: Find MD of the values: 656, 684, 70, 764, 856, 113, 1133, 1303 Step 1: Arrange data in order: 656, 684, 70, 764, 856, 113, 1133, 1303 Step : Select iddle point: Middle point falls between 764 and 856. In this case we calculate the ean of these two values: 764 + 856 MD = Avg of two idpoint values = = 810 Mode Mode is the value that occurs ost often in a data set. A data set can have ore than one ode, or no ode at all. Exaple: 8,9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14, 11, 8, 14, 11 It is helpful to arrange the data in order: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 11, 11, 14, 14, 14 Observe that 8 occurs 5 ties, the ost of any other nuber. Thus, ode = 8. Exaple: 5, 8, 9, 10, 1, 45, 78 : Since each value occurs only once, there is no ode. Exaple: 15, 18, 18, 18, 0,, 4, 4, 4, 6, 6. 18 and 4 both occur three ties, ore than any other values. Thus the ode for this data set is 18 and 4. Grouped Data: The ode for Grouped Data id the odal class. The odal class is the class with the largest frequency. In Exaple 3-3 above, the odal class is 0.5-5.5. The ode is the only easure of central tendency that can be used with categorical distributions. Midrange The idrange is a rough estiate of the iddle. It is found by adding the lowest and highest values in the data set and dividing by. Sybol used: MR 1 + 8 Exaple:, 3, 6, 8, 4, 1 MR = = 4. 5 Section 3- Page 3

M7 Chapter 3 Section The Weighted Mean Weighted Mean ultiply each value by its corresponding weight and divide the su of the products by the su of the weights. w1 X1 + w X + w3 X 3 +... + w wx n X n X = = w + w + w +... + w w 1 3 n where w1, w,..., w n are the weights and X1, X,..., X n are the values. Exaple: Grade Point Average Course Credits (w) Grade ( X ) Engl Cop 3 A (4 points) Intr Psychology 3 C ( points) Bio I 4 B (3 points) Phys Ed D (1 point) wx 3 4 + 3 + 4 3 + GPA = X = = 1 = 3 =.7 w 3+ 3+ 4 + 1 1. The idrange is easy to copute.. The idrange gives the idpoint. 3. The idrange is affected by extreely high or low values in a data set. Section 3- Page 4 Properties and Uses of Central Tendency The Mean 1. One coputes the ean by using all the values of the data.. The ean, in ost cases, varies less than the edian or ode when saples are taken fro the sae population and all three easures are coputed for these saples. 3. The ean is used in coputing other statistics, such as variance. 4. The ean for the data set is unique, and not necessarily one of the data values. 5. The ean cannot be coputed for an open-ended frequency distribution. 6. The ean is affected by extreely high or low values and ay not be the appropriate average to use in these situations. The Median 1. The edian is used when one ust find the center or iddle value of a data set.. The edian is used when one ust deterine whether the data values fall into the upper half or lower half of the distribution. 3. The edian is used to find the average of an open-ended distribution. 4. The edian is affected less than the ean by extreely high or extreely low values. The Mode 1. The ode is used when the ost typical case is desired.. The ode is the easiest average to copute. 3. The ode can be used when the data are noinal, such as religious preference, gender, or political affiliation. 4. The ode is not always unique. A data set can have ore than one ode, or the ode ay not exist for a data set. The Midrange

M7 Chapter 3 Section Distribution Shapes In a syetrical distribution, the data values are evenly distributed on both sides of the ean Syetric Distribution Mean Median Mode In a positively skewed or right skewed distribution, the ajority of the data values fall to the left of the ean and cluster at the lower end of the distribution. Mean Median Mode Right-skewed Distribution When the ajority of the data values fall to the right of the ean and cluster at the upper end of the distribution, with the tail to the left, the distribution is said to be negatively skewed or left skewed. Left-skewed Distribution Mean Median Mode Section 3- Page 5

M7 Chapter 3 Section 3 MEASURES OF VARIATION Exaple 3-18 : Sae ean, very different data spread In order to easure the spread or variability of a data set, we use three easures: Range Variance Standard Deviation. Range The range of a data set is the difference between its highest and lowest values R = Highest Value - Lowest Value Variance and Standard Deviation The easures of variance and standard deviation are used to deterine the consistency of a variable. Rounding Rules: Sae rule as for the ean: The final answer should be rounded to one ore decial place than that of the original data. Variance is the average of the squares of the distance that each value is fro the ean. The sybol for the population variance is σ ; the forula is: Where ( X ) σ = N µ X = individual value µ = population ean N = population size The standard deviation is the square root of the variance. The sybol for the population standard deviation is σ. The corresponding forula is: σ = σ = ( X ) N µ Section 3-3 Page 6

M7 Chapter 3 Section 3 Exaple: 3 1; Consider the data set whose values are in colun A. A Values (X) B X µ C µ ( X ) 10 10 35 = -5 65 60 60-35 = +5 65 50 50-35 = +15 5 30 30-35 = -5 5 40 40 35 = +5 5 0 0-35 = -15 5 ( µ ) = 1750 10 + 60 + 50 + 30 + 40 + 0 10 Step 1: Copute the ean: µ = = = 35 6 6 Step : Subtract the ean (35) fro each value; result in colun B Step 3: Square each difference; result in colun C Step 4: Su the squares in colun C 1750 Step 5: Divide this su by the nuber of values (6) to get the variance: σ = = 91. 7 6 Step 6: Take the square root of this variance to obtain the standard deviation: σ = σ = 91.7 = 17.1 Forula for the Saple Variance and Standard Deviation: Variance of a saple: s = X X n 1 Standard Deviation of a saple: s = X X n 1 Where X = individual value X = saple ean n = saple size Section 3-3 Page 7

M7 Chapter 3 Section 3 Coputational Forulas for saple variance and saple standard deviation: s Saple Variance X = n 1 [( X ) n] Saple Standard Deviation X s = n 1 [( X ) n] Exaple: 11., 11.9, 1.0, 1.8, 13.4, 14.3 Step 1: Find su of values: X = 11. + 11.9 + 1.0 + 1.8 + 13.4 + 14.3 = 75.6 Step : Square each value and find the su: X = 11. + 11.9 + 1.0 + 1.8 + 13.4 + 14.3 = 958.94 Step 3: Substitute in the forula and solve: X ( X ) n 958.94 ( 75.6 ) / 6 s = = = 1.8 n 1 5 s = 1.8 = 1.13 Grouped Data: Variance and Standard Deviation A Class B Frequency (f) C Midpoint ( X ) D f X E f 5.5-10.5 1 8 8 64 10.5-15.5 13 6 338 15.5-0.5 3 18 54 97 0.5-5.5 5 3 115 645 5.5-30.5 4 8 11 3136 30.5-35.5 3 33 99 367 35.5-40.5 38 79 888 n = 0 f X = 490 Step 1: Find idpoint of each class and place in colun C X f X = 13,310 Step : Multiply the frequency of each class by its idpoint and place product in colun D Step 3: Multiply the frequency of each class by the square of its idpoint and place the result in colun E. Step 4: Find the su of coluns B, D, and E. Step 5: Substitute in the Coputational forula and solve for s to get the variance: Section 3-3 Page 8

M7 Chapter 3 Section 3 s s ( ) f X f X / n = n 1 13,310 ( 490) 0 = = 68.7 0 1 Then, take the square root to obtain the Standard Deviation s = 68. 7 = 8.3 Coefficient of Variation is a statistic that allows one to copare standard deviations when the units of the data sets are different. Sybol: Cvar. The coefficient of variation is the standard deviation divided by the ean. The result is expressed as a percentage. s For saples: CVar = 100% X σ For populations: CVar = 100% µ Exaple: In a randoly selected saple of people it is found that their height and weight have the following characteristics: Height: ean = 68.34 in, s = 3.0 in; Weight: ean = 17.55 lb, s = 6.33 lb. How do these two variables copare? Height 3.0 CVar = 100% = 4.4% Weight 68.34 6.33 CVar = 100% = 15.6% 17.55 We see that the heights have considerably less variation than the weights. This akes intuitive sense, since we routinely see that weights between people vary uch ore than heights. Rough Estiate of Standard Deviation range s Range rule of thub 4 Chebyshev s Theore and Epirical Rule Chebyshev s Theore specifies the proportions of the spread of data in ters of the standard deviation: Theore: The proportion of values fro a data set that will fall within k standard deviations of the ean will be at least 1 1/k, where k is a nuber greater than 1 (k is not necessarily an integer). Section 3-3 Page 9

M7 Chapter 3 Section 3 At Least 88.9% At Least 75% X 3s X s X X + s X + 3s Chebyshev's Theore (for k= and k=3) Exaple 1: What percent of data values will fall within standard deviations of the ean? Calculate 1 1 k for k = : 1 1 1 1 3 4 75% = 4 = = Thus, if we have a distribution that has a ean of 70 and a standard deviation of, at least 75% of the data values will fall between 66 and 74 (70 = 66 and 70 + = 74) Exaple : A distribution has a ean of 50 and a standard deviation of 5. At least what percentage of the values will fall between 40 and 60? Step 1: Subtract the ean fro the largest value: 60 50 = 10 Step : Divide the difference by the standard deviation to get k: K = 10 / 5 = Step 3: Use Chebyshev s Theore to find the percentage: 1 1/k = 1 1/ = 1 1/4 = 0.75 or 75% Exaple : A saple of the labor costs per hour to asseble a certain product has a ean of $.60 and a standard deviation of $0.15. Using Chebyshev s theore, find the range in which at least 88.89% of the data will lie. Step 1: Calculate k: 1-1/k =.8889 k 1 = 0.8889k k (1-0.8889) = 1 k = 1/0.1111 = 9 k = 3 Step : 3 standard deviations are: 3 (0.15) = 0.45. Step 3: Range is: Low:.60 0.45 =.15, High:.60 + 0.45 = 3.05. Section 3-3 Page 10

M7 Chapter 3 Section 3 Epirical Rule for Noral Distributions The following apply to a bell-shaped (noral) distribution. Approxiately 68% of the data values fall within one standard deviation of the ean. Approxiately 95% of the data values fall within two standard deviations of the ean. Approxiately 99.75% of the data values fall within three standard deviations of the ean. 99.8% 75% 68% X 3s X s X 1s X X + 1s X + s X + 3s Epirical Rule for Noral Distributions Suary of Measures of Variation Variances and standard deviations can be used to deterine the spread of the data. If the variance or standard deviation is large, the data are ore dispersed. The inforation is useful in coparing two or ore data sets to deterine which is ore variable. The easures of variance and standard deviation are used to deterine the consistency of a variable. For exaple, in the anufacture of fittings, such as nuts and bolts, the variation in the diaeters ust be sall, or the parts will not fit together. The variance and standard deviation are used to deterine the nuber of data values that fall within a specified interval in a distribution. For exaple, Chebyshev s theore (explained later) shows that, for any distribution, at least 75% of the data values will fall within standard deviations of the ean. The variance and standard deviation are used quite often in inferential statistics. Section 3-3 Page 11

M7 Chapter 3 Section 4 MEASURES OF POSITION Measures of position are used to located the relative position of a data value in the data set z score Percentiles Quartiles Deciles z-score or standard score represents the nuber of standard deviations that a data value ean value lies above or below the ean: z = standard deviation X X z = for saples and s µ z = X for populations σ Exaple 1: Which of these exa grades has a better relative position? (a) A grade of 56 on a test with ean of 48 and a standard deviation of 5. (b) A grade of 45 on a test with ean of 35 and standard deviation of 10. a) z = (56-48)/5 = 8/5 = 1.6 b) z = (45-35)/10 = 10/10 = 1 Exaple : Huan body teperatures have a ean of 98.0 and a standard deviation of 0.6. An Eergency roo patient is found to have a teperature of 101. Convert 101 to a z-score. z = (101-98.)/0.6 =.8/0.6 = 4.51 When all data for a variable are transfored into z scores, the resulting distribution will have a ean of 0 and a standard deviation of 1. A z score, then, is actually the nuber of standard deviations each value is fro the ean for a specific distribution. See exaple below: X z z*s Mean+z*s 4-0.74-3.7 4 7-0.14-0.7 7 1-1.34-6.7 1 7-0.14-0.7 7 15 1.46 7.3 15 3-0.94-4.7 3 11 0.66 3.3 11 17 1.86 9.3 17 7-0.14-0.7 7 9 0.6 1.3 9 4-0.74-3.7 4 Mean 7.7 0 s 5 1 Section 3-4 Page 1

M7 Chapter 3 Section 4 Percentiles Percentiles divide data into 100 equal groups. Percentiles are sybolized by P 1, P,, P n and divide the distribution into 100 groups. Percentile Forula (Percentile Rank) The percentile corresponding to a given value x is coputed by using the following forula: ( nuber of values below x) + 0.5 Percentile = 100% total nuber of values Exaple 1: Find the percentile rank for each test score in the data set: 5, 15, 1, 16, 0, 1 Step 1: Arrange data in ascending order: 5, 1, 15, 16, 0, 1 +.5.5 50 Step : Use forula (say for 15): P = 100% = 100% = % = 41.7 percentile 6 6 6 Forula for finding a value corresponding to a given percentile P P - is the nuber that separates the botto % of the data fro the top (100- )% of the data. This ean that if your test score is the 90 th percentile, 90% of the people who took the test scored lower than you and only 10% scored higher than you. Finding the location of P : c = ( ) n 100 1. If c is not a whole nuber, round up to the next whole nuber. Starting at the lowest value, count over to the nuber that corresponds to the rounded-up value.. If c is a whole nuber, use the value halfway between the c th and (c+1) th values when counting up fro the lowest value. Section 3-4 Page 13

M7 Chapter 3 Section 4 Quartiles divide the distribution into four groups, separated by Q 1, Q, Q 3. Q 1 is the sae as the 5% percentile; Q is the sae as the 50% percentile; Q 3 is the sae as the 75% percentile. Quartiles ac also be found as follows: 1. Arrange the data in ascending order. Find the edian of the data values, this is the values for Q. 3. Find the edian of the data values that fall below Q. This is the value for Q 1. 4. Find the edian of the data values that fall below Q. This is the value for Q 3. Exaple: Find Q 1, Q, and Q 3 for the data set: 15, 13, 6, 5, 1, 50,, 18 1. Arrange in ascending order: 5, 6, 1, 13, 15, 18,, 50 13+ 15. Find the edian (Q ): Q = MD = = 14 3. Find the Median of the data values less than 14: 5, 6, 1, 13 6 + 1 Q1 = MD = = 9 4. Find the Median of the data values greater than 14: 15, 18,, 50 18 + Q3 = MD = = 0 IQR: Interquartile Range: used as a rough estiate of variability in exploratory data analysis and it is also used to identify Outliers. IQR = Q 3 Q 1 An Outlier is an extreely high or low data value when copared with the rest of the data values An outlier can strongly affect the ean and standard deviation of a variable. One ethod to check for Outliers: Step 1: Arrange the data in order and find Q 1 and Q 3. Step : Find IQR = Q 3 Q 1. Step 3: Multiply IQR by 1.5 Step 4: Subtract the value obtained in Step 3 fro Q 1 and add the value to Q 3. Step 5: Check the data set for any data value that is saller than Q1 1.5 IQR or larger than Q3 + 1.5 IQR Another ethod, applicable to bell-shaped distributions, is to check the data value in question to: 3s ± X (3 standard deviations larger / saller than the ean). If it is larger (or saller), then the data value is considered an outlier. Section 3-4 Page 14

M7 Chapter 3 Section 5 EXPLORATORY DATA ANALYSIS (EDA) Traditional vs Exploratory The purpose of the traditional analysis is to confir various conjectures about the data. The purpose of exploratory data analysis is to exaine data in order to find out what inforation can be discovered. For exaple: o Are there any gaps in the data? o Can any patterns be discerned? In EDA we use the following easures: Organize data: Ste and Leaf plot Measure of Central Tendency: Median Measure of Variation: Interquartile range (Q 3 Q 1 ) Graphing: Boxplot Boxplots are graphical representations of a five-nuber suary of a data set. The five specific values that ake up a five-nuber suary are: The lowest value of data set (iniu) Q1 (or 5th percentile) The edian (or 50th percentile) Q3 (or 75th percentile) The highest value of data set (axiu) Exaple: Construct a Boxplot for the data: 33, 38, 43, 30, 9, 40, 51, 7, 4, 3, 31 1. Arrange the data in order: 3, 7, 9, 30, 31, 33, 38, 40, 4, 43, 51. Find the edian: MD = 33 3. Find Q 1 = 9 4. Find Q 3 = 4 5. Draw a scale for the data on the x-axis. 6. Locate the five-nuber suary on the scale. 7. Draw a box around Q 1 and Q 3, draw a vertical line through the edian, and connect the upper and lower values, as shown below: 9 33 4 3 51 0 5 30 35 40 45 50 55 Boxplot Section 3-5 Page 15

M7 Chapter 3 Section 5 If the boxplots of two or ore data sets are graphed on the sae axis, the distributions can be copared. Exaple: Consider the following two sets of data A B 40 130 45 180 90 50 180 60 0 70 40 90 310 310 40 340 40 67. 5 130 00 75 15 65 300 340 40 A B SDEV 133. 68.6 Q 1 67.5 15 Q 75 300 MD 00 65 0 100 00 300 400 500 The variation or spread for the distribution of Data A is larger than the distribution of data B (observe also their standard deviations.) SUMMARY: Traditional Frequency Distribution Histogra Mean Standard Deviation Exploratory Ste and Leaf plot Boxplot Median Interquartile Range Soe basic ways to suarize data include easures of central tendency, easures of variation or dispersion, and easures of position. The three ost coonly used easures of central tendency are the ean, edian, and ode. The idrange is also used to represent an average. The three ost coonly used easureents of variation are the range, variance, and standard deviation. The ost coon easures of position are percentiles, quartiles, and deciles. Data values are distributed according to Chebyshev s theore and in special cases, the epirical rule. The coefficient of variation is used to describe the standard deviation in relationship to the ean. These ethods are coonly called traditional statistics. Other ethods, such as the ste and leaf charts, boxplot and five-nuber suary, are part of exploratory data analysis; they are used to exaine data to see what they reveal. Section 3-5 Page 16