Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Chapter 6 The Standard Deviation as a Ruler and the Normal Model 1 /67

Homework Read Chpt 6 Complete Reading Notes Do P129 1, 3, 5, 7, 15, 17, 23, 27, 29, 31, 37, 39, 43 2 /67

Objective Students calculate a z-score and demonstrate the effect of adding to and scaling a data distribution 3 /67

The Standard Deviation as a Unit of Measure Remember when we compared the shoe sizes for adult men and women. We know that women tend to have smaller feet, and male and female shoe sizes do not represent the same size foot. A man s size 9 is not the same as a woman s size 9. But it is possible to compare some measurement of a woman s size relative to other women with a similar measure for men, one comparing his size with the sizes of other men? Of course the answer is yes. I would not have asked otherwise. 4 /67

The Standard Deviation as a Unit of Measure The unit of measure we can use for comparing differing data values is the standard deviation. The standard deviation tells us how the distribution of data values varies (in relation to the mean of the distribution), so it is ideal for comparing one individual to a group and for comparing groups with differing means. I have repeatedly said that statistics is the study of variation. The most commonly used measure of variation is the standard deviation and it forms the basis for nearly all the statistics we will do this year. 5 /67

The Standard Score A data value is converted to a standard score to provide a means for comparison. We convert individual data values, relative to their standard deviation, using the following formula: z = observed expected = score mean = Y Y standard deviation standard deviation S or Y µ σ The resulting values (standardized values), denoted as z, are called z-scores 6 /67

Standard Score Remember! z = observed expected standard deviation 7 /67

The Standard Score z = Y Y S or Y µ σ Remember, this is to convert a raw score into a standardized score. Raw scores above the mean will have a z-score > 0, and raw scores below the mean will have a z-score < 0. 8 /67

The Standard Score Standardized values (z-scores) have no units. z-scores are the distance of an individual data value from the mean in standard deviations. In other words, a z-score is the number of standard deviations away from the mean for a data value. A negative z-score tells us that the data value is below the mean. A positive z-score tells us that the data value is above the mean. 9 /67

Standard Score A Z score indicates the number of standard deviations an observation falls from the mean. 10/67

Why Do We Bother? Standardized values have been converted from their original units to standard statistical unit of standard deviations from the mean. Thus, we can compare values that are measured on different scales, with different units, or from different populations. Now we can compare the shoe sizes for men and women. The values give us a measure of how a data value compares to other data values of like kind. A measure of how different from an expected value is a given data value. 11/67

Benefits of Standardizing The 2015 SAT test had a mean score of 511 for math. The standard deviation for the math scores was 120. An SAT score of 550 would have a z-score of: z = 550 511 120 =.325 That tells us the score of 550 was 1/3 of a standard deviation above the mean. How would that score compare to an ACT score of 23 on the math portion? 12/67

Benefits of Standardizing The ACT results for that same year had a mean of 20.8 and standard deviation of 5.2. The z-score for a 23 on the ACT would be z = 23 20.8 5.2 =.4231 A z-score of.4231 tells us that a 23 is about.4 standard deviations above the mean ACT score. Thus a score of 23 on the ACT is better than a 550 on the SAT, relative to the mean for each test. 13/67

Shifting Data Shifting data (horizontal translation): Adding (or subtracting) a constant amount to each and every value in a distribution just adds (or subtracts) the same constant to (from) the mean, median and other measures of position. This has the effect of shifting the distribution horizontally. Measures of spread (variance, IQR, range), however, are not affected by adding a constant to every value in the distribution. 14/67

Shifting Data The United States of Amurica has been half-heartedly trying to implement the metric system of measurements. Students were asked to estimate the length of their classroom in meters. The following data shows the results. Collection 1 Dot Plot 8 9 10 10 10 10 10 10 11 11 11 11 12 12 13 13 13 14 14 14 15 15 15 15 15 15 15 15 16 16 16 17 17 17 17 18 18 20 22 25 27 35 38 40 Another plot shows the amount of Collection 1 Dot Plot error in the guesses. It was found by -5 0 5 10 15 20 25 30 Error subtracting the correct value (13) from each estimate. 0 10 20 30 40 Guess Compare the resulting dot plots. 15/67

Rescaling Data Rescaling data (stretch or compress): When we divide or multiply all the data values in a distribution by any constant value, all measures of position (such as the mean, median and percentiles) and measures of spread (such as the range, IQR, and standard deviation) are divided or multiplied by that same constant value. Collection 1 Dot Plot -5 0 5 10 15 20 25 30 Error 16/67

Rescaling Data (cont.) The errors in meters can also be measured in feet. 1 meter = 3.28084 ft. So we multiply each value by 3.28084. Now compare the plots Collection 1 Dot Plot Collection 1 Dot Plot -5 0 5 10 15 20 25 30 Error -20 0 20 40 60 80 100 error_ft 17/67

18/67

TI-84 Enter the data into L 1. Find the mean, standard deviation, and 5-number summary. 8 9 10 10 10 10 10 10 11 11 11 11 12 12 13 13 13 14 14 14 15 15 15 15 15 15 15 15 16 16 16 17 17 17 17 18 18 20 22 25 27 35 38 40 Create L 2 : subtract 13 from every entry in L 1. This gives the errors. Find the mean, standard deviation, and 5-number summary. Create L 3: multiply every entry in L 1 by 3.28. Now we have feet estimates. Find the mean, standard deviation, and 5-number summary. Create L 4 : multiply every entry in L 2 by 3.28. Errors in feet. Find the mean, standard deviation, and 5-number summary 19/67

TI-84 These are the results. Collection 1 mean count std dev std err missing min Q1 Med Q3 max Guess Error Feet error_ft 16.0227 44 7.14465 1.0771 0 8 11 15 17 40-13 x 3.28-13(x3.28) 3.02273 44 7.14465 1.0771 0-5 -2 2 4 27 52.568 44 23.4404 3.53378 0 26.2467 36.0892 49.2126 55.7743 131.234 9.91708 44 23.4404 3.53378 0-16.4042-6.56168 6.56168 13.1234 88.5827 Note the relationships between the values. Adding a constant only affects measures of position like mean and median. Multiplying (scaling) affects all measures. 20/67

Back to z-scores Standardizing data into z-scores shifts the data by subtracting the mean and rescales the values by dividing by their standard deviation. Standardizing into z-scores does not change the shape of the distribution. What is the z-score for the mean of a distribution? Standardizing into z-scores changes the center by making the mean = 0. What is the z-score for the score one standard deviation above the mean? Standardizing into z-scores changes the spread by making the standard deviation = 1. 21/67

Mean and Standard Deviation To find the mean of the z-score distribution. We have subtracted the mean (a constant) from every score and divided by the standard deviation (another constant). Both transformations effect the mean The result is a mean of the z distribution of 0. Z = X X s = 0 The standard deviation is only effected by the multiplication. s z = s s = 1 So the standard deviation of the z distribution is 1. ( ) X z = X + s s = 1 22/67

When is Unusual Really Unusual? A z-score gives us an indication of how unusual a value is because it tells us how far the raw score is from the mean. A data value that sits right at the mean, has a z-score equal to 0. A z-score of 1 means the original data value is 1 standard deviation above the mean. A z-score of 1 means the original data value is 1 standard deviation below the mean. 23/67

When is Unusual Really Unusual? How far from 0 does a z-score have to be to be interesting or unusual? There is no universal standard, but the more extreme a z-score (negative or positive), the more unusual it is. We will have a more definitive answer soon. Remember that a negative z-score tells us that the data value is below the mean, while a positive z-score tells us that the data value is above the mean. 24/67

Are We Normal? There are an infinite number of data distributions. Change the mean or standard deviation of the data and you have a new distribution. For every distribution there is a corresponding z distribution. Remember, the shape does not change. There is a family of distributions (models) of data values that are extremely useful. The ideal of these models is called the Normal model (commonly called bell-shaped curves ). Normal models are appropriate for distributions whose shapes are unimodal and sufficiently symmetric. These distributions allow us to determine how extreme a z-score (and thus, the raw score) is. 25/67

Are We Normal? N(µ,σ) There is a Normal model (of raw data values) for every possible combination of mean and standard deviation. We write N(µ,σ) to represent a Normal model with a mean of µ and a standard deviation of σ. We use Greek letters for population parameters. The mean, µ, and standard deviation, σ, are not numerical summaries of the data. They are part of the model. They do not come from sample data. They are numbers that we choose (or know) to specify the model. Our statistics, x and s, are our best estimates of those values. 26/67

Are We Normal? Summaries of data taken from samples, like the sample mean and sample standard deviation, are written with Latin letters. These sample summaries of data are called sample statistics. When we standardize Normal data, the standardized value is a z-score, and: z = y µ σ You should note that only some data distributions are sufficiently unimodal and symmetric to be modeled by the normal distribution. Normal models approximate the shape of those distribution and occur often when there are many factors influencing the values of the data. 27/67

Are We Normal? Once we have standardized, we need only one model: The N(0,1) model is called the Standard Normal model (or the Standard Normal distribution). We cannot use a Normal model for just any data set. The data must be unimodal and sufficiently symmetric for us to use a normal model. Standardizing does not change the shape of the distribution. 28/67

Are We Normal? When we use the Normal model, we are assuming the population distribution is Normal. (Unimodal, and sufficiently symmetric). We cannot truly know the assumption of normality is valid, so we check the following condition: Nearly Normal Condition: The shape of the data s distribution is sufficiently unimodal and symmetric. This condition can be checked with a histogram or a Normal Probability Plot (NPP to be explained later). 29/67

The 68-95-99.7 Rule When our data is unimodal, and sufficiently symmetric, a Normal model gives us an idea of how extreme a given data value is by telling us how likely we are to find a data value that far from the mean. To do this we use a Normal curve, under which the area is set to 1. The 68-95-99.7 rule (also known as the Empirical Rule), provides an estimate of these values at 1, 2, and 3 standard deviations from the mean. These are approximations of values that come from evaluations of the standardized normal curve. 30/67

The Standardized Normal Curve 2 (x µ) f (x) = e 2σ 2 σ 2π -3-2 -1 0 1 2 3 Z score We can, and will (repeatedly), find these numbers accurately, but we can use this simplified rule about the Normal model to give us the ability to quickly estimate 31/67

The 68-95-99.7 Rule It turns out that in a Normal model: about 68% of the values fall within one standard deviation of the mean; about 95% of the values fall within two standard deviations of the mean; and, about 99.7% of the values fall within three standard deviations of the mean. 32/67

The 68-95-99.7 Rule The following shows what the 68-95-99.7 Rule tells us: 99.7% 95% 68% -3-2 -1 0 1 2 3 Z score 33/67

The First Three Rules for Working with Normal Models Draw a pichur. Draw a pichur. Draw a pichur. And, when we have data, one necessary picture is a histogram to check that the distribution is sufficiently unimodal and symmetric to ensure we can use the Normal model to model the distribution. Additionally, I (and AP) will expect to see a normal model, appropriately shaded, on nearly everything you do from this point on. 34/67

Describing the Appropriate Model When asked to describe the model used in analyzing data, provide the properly labeled graphic with the 69-95 - 99.7 shown. 99.7% 95% 68% 55 70 85 100 115 130 145 IQ Scores 35/67

Finding Normal Percentiles by Table When a data value doesn t fall exactly 1, 2, or 3 standard deviations from the mean, we can look it up in a table of Normal percentiles. (Actually, we will use the calculator). Table Z in Appendix D provides us with normal percentiles. I can provide you with a z table if you prefer that to the calculator. 36/67

Finding Normal Percentiles by Table There are two forms of the z-table. On the tables I pass out you will find two sides. These are the two forms of the z-tables. One table gives the area to the left of the given z value. One table gives the area between the mean and the given z value. 37/67

Finding Normal Percentiles by by Table Hand This table shows the area under the curve between the z-score and the mean. 41.15% 0 1.35.4115 38/67

Finding Normal Percentiles by Table This table shows the area under the curve to the left of the z-score. 91.15% - 0 1.35.9265 39 /67

Finding Normal Percentiles by Table To find the area between the mean and 1.27 standard deviations above the mean, we locate z = 1.2 in the first column, then scroll across to the column under.07 (1.2 +.07 = 1.27) 39.80% 0 1.27 40/67

Finding Normal Percentiles by Table Thus the area under the curve between the mean and 39.80% 1.27 standard deviations above the mean is.3980. In other words 39.80% of the total area falls in the region between z = 0 and z = 1.27. 0 1.27 Due to the symmetry of the normal model. 39.80% 39.80% of the total area also falls in the region between z = 0 and z = 1.27. 0 1.27 41/67

Finding Normal Percentiles by Table To find the percentage of scores between z =.44 and z = 1.50, find the area to the left of each z-score: (p(z <.44) =.6700, p(z < 1.5) =.9332) and subtract. z =.44 z = 1.50.9332 -.6700 =.2632 or 26.32% p(.44 < z < 1.5) =.2632 26.32% -3-2 -1 0 1 2 3 42/67

Finding Normal Percentiles by Table To find the percentage of scores between z = -.43 and and z = 0.78, find the area to the left of z =.43 [p(z <.43) =.3336], and to the left of z = 0.78 p(z <.78 =.7823), then subtract the two values..7823 -.3336 =.4487 or 44.87% z = -.43 z =.78 p(-.43 < z <.78) =.4487 44.87% -3-2 -1 0 1 2 3 43/67

Tails To find the area in the tails above a positive z-score or below a negative z-score, simply find the corresponding area for the z-score and subtract from 1.0. Percent of scores below z = 1.5 p(z < 1.5) =.9332 z = 1.50 1.0 -.9332 =.0668 = 6.68% 6.68% Percent of scores above z = 1.5 p(z > 1.5) =.0668-3 -2-1 0 1 2 3 44/67

Finding Normal Percentiles Now that we have spent all that time, let us get real. There is really no reason to use a table because we have a calculator that will do the job much more accurately and much more simply. 45/67

TI-84 To find the probability (area) between two x values: 2nd - Distr - 2:normalcdf(x min,x max,µ,σ) (normal cumulative distribution function) 2nd distr VARS 2:normalcdf lower: upper: μ: σ: Paste If you wish to find the probability between two x values and see the graph: 2nd - Distr - Draw - 1:ShadeNorm(x min,x max,µ,σ) Be sure to set your window xmin, xmax (depending on data values), ymin = 0, ymax =.4 46/67

TI-84 To find probabilities from a standardized normal distribution on the calculator is exactly the same as with a non-standardized normal distribution. Simply use µ = 0 and σ = 1. 2nd - Distr - 2:normalcdf(z min,z max,0,1) 2nd distr VARS 2:normalcdf lower: upper: μ: 0 σ: 1 Paste 47/67

TI-84 To find the area under the curve below a z-score of 1.45 2nd - Distr - 2:normalcdf(-10^99,1.45,0,1) =.9264706996 2nd distr VARS 2:normalcdf lower:-10^99 upper: 1.45 μ: 0 σ: 1 Paste To find the area under the curve above a z-score of 1.45 2nd - Distr - 2:normalcdf(1.45,10^99 0,1) =.0735293004 2nd distr VARS 2:normalcdf lower: 1.45 upper: 10^99 μ: 0 σ: 1 Paste 48/67

TI-84 To find the area under the curve between a z-score of -1.45 and +1.45 2nd - Distr - 2:normalcdf(-1.45, 1.45,0,1) =.8529413993 2nd distr VARS 2:normalcdf lower: -1.45 upper: 1.45 μ: 0 σ: 1 Paste 49/67

From Percentiles to Scores: z in Reverse Sometimes we start with area (probability) and need to find the corresponding z-score or the original observation value. Example: What z-score represents the first quartile in a Normal model? 25% 25%?? 50/67

From Percentiles to Scores: z in Reverse via Table Look in Table Z for an area of 0.2500. The exact area is not there, but 0.25 is between 0.2486 and 0.2517. This figure is associated with z = 0.675, so the first quartile is about 0.675 standard deviations below the mean. 51/67

Probability to z-score with a table To find a z-score corresponding to an area, or to know what z-score lies above, or below a given percentage, simply work backwards. In the table, first find the percentage (probability) in the field of the table. Then determine the z-score by combining the the row and column headings. 52/67

Probability to z-score To find the z-score below which falls 72% of the scores (72nd percentile) locate.72 in the table..7190.7190.72 0.5 + 0.08 =.58 72% of scores fall below a z-score of.58 53/67

Probability to z-score on TI-84 Back to reality To find the z-score when you know the percentile: 2nd - Distr - 3:invNorm(Area, µ,σ) 2nd VARS 3:invNorm To find the z-score below which falls 72% of the scores distr area: μ: σ: Paste (72nd percentile): 2nd - Distr - 3:invNorm(.72, 0,1) =.582841502 72% of the distribution falls below a z score of about.58. 2nd distr area:.72 3:invNorm μ: 0 σ: 1 Paste VARS 54/67

Probability to z-score on TI-84 Remember, the TI-84 requires the area (probability) BELOW the desired score. If you are given the area ABOVE, you must subtract from 1 to find the area below. To find the z-score above which falls 25% of the scores (3rd quartile): 2nd - Distr - 3:invNorm(.75, 0,1) =.6744897495 2nd distr area:.75 3:invNorm μ: 0 σ: 1 Paste VARS 25% of the normal distribution falls above a z score of about.67. 55/67

When Are You Normal Enough? When you have your own data, you must verify whether a Normal model is reasonable and appropriate. AP and I will require you to verify the appropriateness of the methodology by ensuring the data distribution is sufficiently unimodal and symmetric. Looking at a histogram of the data (sample set) is probably the best way to check that the underlying distribution is sufficiently unimodal and symmetric. 56/67

What is Normal Enough? Another graphical display that can help you determine whether a Normal model is appropriate is the Normal Probability Plot (NPP). If the distribution of the data is roughly Normal, the Normal probability plot approximates a diagonal straight line. Significant deviations from a straight line indicate that the distribution is not Normally distributed. 57/67

Normal Probability Plot To create a Normal Probability Plot (NPP) (which you will not.) Order the data in ascending order. Calculate the percentile (cumulative percentage) for each data value. Find the z-score from a normal distribution matching the calculated percentile. (Note: this is NOT the z-score calculated from the observation, but the z score determined by the position of the observation in the distribution.) Plot the points (z x, z p ), where z x is the score calculated for the observation value and z p is the z-score for the corresponding percentile. Check for linearity. A good fit to a straight line suggests a normal distribution. 58/67

TI-84 A better way to create a NPP is to let the calculator do it. Statplot Choose the last icon Enter the list for your data Select x or y for your data axis (it really does not matter) Choose your marks (bigger is better for old eyes) Zoom 9: ZoomStat to see your NPP If the line appears to be linear, you got normal. 59/67

Testing for Normalcy Nearly Normal data have a histogram and a Normal probability plot that look somewhat like this example of 100 values randomly sampled from a normal distribution.: normal probability plot 60/67

Testing for Normalcy Data Sampled From a Left Skewed Distribution normal probability plot 61/67

Testing for Normalcy Data Sampled From a Right Skewed Distribution normal probability plot 62/67

Hot Beverage Dispenser Suppose a vending machine dispenses hot beverages into 8 ounce cups. The machine is set to dispense with a standard deviation of 0.2 ounces. Providing too much beverage risks burning the customer. Providing too little beverage risks frustrating the customer. σ X = 0.2 1. What mean setting do you recommend for the machine? If we want to keep the vast majority of customers happy we should ensure that 95% of customers get at least 8 ounces. 63/67

Hot Beverage Dispenser If we want to keep the vast majority of customers happy we should ensure that 95% of customers get at least 8 ounces. σ X = 0.2 We need to find the mean such that 95% of our data falls above 8 oz. To find the mean, we must first find the z-score above which fall 95%. invnorm(.05, 0, 1) = 1.6449 Why.05? 95% To find the mean, we must convert the z-score. 5%? 0 z = x µ X σ X 1.6449 = 8 µ X 0.2 µ X = 8.3290 8? To ensure 95% of customers get at least 8 oz, the machine should be set with a mean of 8.3290 oz. 64/67

Hot Beverage Dispenser The company is not happy with our recommendation of 8.329 ounces. That would cost too much. We must come up with another recommendation to ensure 95% of customers get at least the full 8 oz. What do you suggest??% Of course we could lower the mean volume to 8.1 oz. but that would leave a lot of customers unhappy. What percentage of customers would get cheated? 8 0 8.1 Normalcdf( 10^99, 8, 8.1,.2) =.3085 Normalcdf( 9,.5, 0, 1) =.3085 OR z = 8 8.1 0.2 =.5000 30.85% of the vendors customers would be shorted. 65/67

Hot Beverage Dispenser Naturally the company is not pleased. The 8.1 oz is acceptable to the company but the percent below 8 is much too high. You just cannot seem to make some people happy. So what else might we do to make the company and customers happy? 95% We could recommend the company produce vending 5% machines that are more consistent in distribution. That means reducing the variability in the data. 1.6449 8 0 8.1 What would be an appropriate standard deviation to ensure 95% of customers get at least 8.0 ounces with a mean of 8.1 oz? invnorm(.05, 0, 1) = -1.6449 σ = 0.0608 1.6449 = 8 8.1 σ The standard deviation would need to be.0608 oz. 66/67

Requirement From this point forward it will no longer be sufficient simply to find the requested values. You must conclude every problem with a concluding sentence. We found the standard deviation of.0608 so our concluding sentence would be something on the order of: To ensure 95% of customers receive at least 8 oz with a mean of 8.1 oz, the standard deviation would need to be.0608 oz. 67/67