Chapter 5. Understanding and Comparing. Distributions

STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27

Boxplots How to create a boxplot? Assume we are given the histogram and 5-number summary. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 2 / 27

Step 1: draw a box with bottom Q1 and top Q3, then insert a line at Q2. Note: The red lines and labels of Q1,Q2,Q3 are NOT necessary, for illustration only. Step 2: draw two fences : upper fence = Q3 + 1.5 IQR, lower fence = Q1 1.5 IQR. Step 3: draw whiskers -draw lines from the ends of the box to the largest and smallest values within the fences. Step 4: add outliers, observations out of the fences, with special symbols. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 3 / 27

Summary of Boxplots The bottom line and the top line of the box are Q1 and Q3. The height of the box is IQR. The line insider the box is the median. If the median line is centred, then the distribution is symmetric. If the median line is closer to the bottom (Q1), equivalently Q2 Q1 < Q3 Q2, the distribution is right skewed. If the median line is closer to the top (Q3), equivalently Q2 Q1 > Q3 Q2, the distribution is left skewed. Boxplots can be drawn horizontally. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 4 / 27

Comparing Groups with Boxplots Conclusions: wind speeds are low in the summer. The tendency is to go down from Jan to Aug, and then go up. Jan has the strongest winds with the largest spread. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 5 / 27

Chapter 6 The Standard Deviation and the Normal Model Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 6 / 27

z-score z-score, also called standardized value, is a measure of relative standing. Assume y is an observation from a sample with mean ȳ and standard deviation s. Then z-score of y is defined as z = y ȳ. s This is the most important formula for the midterm. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 7 / 27

Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 8 / 27

z-score tells: how many standard deviations away from the mean does the measurement lie and in which direction? Positive z-score: observation is greater than the mean. Negative z-score: observation is smaller than the mean. Zero z-score: observation is equal to the mean. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 9 / 27

Shifting Data Add (or subtract) a constant c to each value of the data. Results: all measures of position (centre, percentiles, minimum, maximum) will increase (or decrease) by the same constant. However, the spread (range, IQR, standard deviation) does NOT change. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 10 / 27

Rescaling Data Multiply (or divide) all the data values by a constant d. In formula, vspace-1.5ex y new = d y original. Result: position new = d position original. spread new = d spread original. Standardizing into z-scores involves shifting down by the value of the mean and rescaling (dividing) by the value of the standard deviation. Standardizing into z-scores changes the centre by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 11 / 27

Density Curve Note: The area enclosed by the density and the x-axis is always 1. Why? Relative frequency adds up to 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 12 / 27

Histogram VS Density Both describe the overall shape of the data, but density curve is smooth (without sharp corners). You can think density curve as a limit case of histogram when the class width approaches 0 (rectangles get narrower). Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 13 / 27

As shown by the graph, the area between a and b under the density curve is the proportion (percentage) of observations that fall in [a, b]. What if we want to know the proportion of observations that lie below a or above b? Note: we do NOT discuss the proportion of observations that hit exactly a or b in a density curve. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 14 / 27

Normal Model The density curve of a normal distribution/model is bell-shaped, symmetric and unimodal. Its shape is determined by two parameters: the mean µ (also the median and the mode) and the standard deviation σ. The above graph is the density curve of the standard normal distribution with µ = 0 and σ = 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 15 / 27

Standard Normal Model Recall what we have learnt from shifting and rescaling data. Assume we are given a normal model with the mean µ and the standard deviation σ (short notation N(µ,σ), where N stands for normal distribution). By subtracting µ and dividing by σ for all values (exactly the same as z-score), we obtain the standard normal model: z = y µ σ. Thus, only the distribution of the standard normal is provided in the table. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 16 / 27

The 68-95-99.7 Rule Does this graph look somehow familiar to you? If NO, go back to the slide of Empirical rules in Chapter 4. In a normal model, approximately 68% of the values fall within one standard deviation of the mean. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 17 / 27

Normal Table Important! You must know how to use the normal table! The normal table provides proportion of the left tail (shadowed area) of the standard normal model below a given value z. The value of z is provided by two side bars: integer part and the first decimal by the vertical bar while the second decimal by the horizontal bar. Example: to find the proportion of values below 3.65, we first locate the row of 3.6 from the rightmost column, next locate the second decimal 0.05 from the top row. Then the unique intersection 0.0001 (0.01% if converted into percentage) gives the answer. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 18 / 27

The first page of the z-table covers negative z values from -3.99 to 0, while the second page goes to the positive side. But, still left tail. What if question asks you to find the proportion of observations above a number (right tail), say greater than 0.19? From the table, we can obtain that the proportion of observations that fall below 0.19 is 0.5753. Since the total area is 1, the area of the right tail is 1 0.5753 = 0.4247 = 42.47%. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 19 / 27

Some people have no interest in either tails, they rather care about the middle portion of the standard normal model. Example: what is the proportion of the values between -0.52 and 1.19 in the standard normal model? From the table, we find two numbers: 0.3015 (from -0.52) and 0.8830 (from 1.19). (Please check!) Of course, these two numbers are the area of the left tails of -0.52 and 1.19. To get the area between these two values, we only need to do subtraction: 0.8830-0.3015=0.5815=58.15%. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 20 / 27

Quick Summary Find the proportion of values in an interval. Interval can only take three types. z < a or the values below a (left tail): directly report the number found from the table. a < z < b or the values between a and b (middle interval): bigger number (found from using b) - smaller number (found from using a). z > b or the values above b (right tail): 1 - the number found from the table. Not standard normal? Convert into the standard normal by z = y µ σ. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 21 / 27

From Percentiles to Scores We just learnt how to find proportion from z-scores. Now we study how to go backwards, finding z-scores for given percentiles. 1 Obtain the proportion below z (left tail). Think of the three cases discussed in the previous slide. 2 In the normal table, find the number (with four decimals) which is closest to the proportion. 3 From the position of the number, identity the value of z-score. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 22 / 27

Examples Example 6.1 Suppose we want to find the z-score, z, that makes up the smallest 2% in the standard normal model. Smallest indicates the left tail. So in this question, the proportion of the left tail is directly given, which is 2%, or 0.0200. The closest number to 0.0200 in the normal table is 0.0202. Do NOT look for 0.2 on the leftmost column under z. Proportion is known, but z-score is unknown. From the position of 0.0202, look to the rightmost column, we get -2.0, to the topmost row 0.05. Hence, the z-score is 2.05. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 23 / 27

Examples Example 6.2 Suppose now we are interested in the largest 5%. largest =right tail. So we are looking for z such that the area of (z > z ) is 5%, or 0.05. The corresponding area of the left tail is then 0.95. From the table, we find 0.9495 and 0.9505, both are equally closest to 0.95 among all numbers. Notice that 0.9495 gives z = 1.64 and 0.9505 gives z = 1.65. In this special case, since 0.95 is exactly the middle of 0.9495 and 0.9505, we take z-score to be the middle of 1.64 and 1.65 as well. The solution is then 1.645. Note: remember this example! You need the result million times throughout the course. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 24 / 27

Examples Example 6.3 Now we want to find z-scores that given the proportion of 95% in the middle. We are looking for z such that the area between z and z is 0.95. Can you tell why these two statements are equivalent? Do you remember normal distributions are all symmetric, including the standard normal model. After partitioning the middle 95% out, we are left with 5% for two tails with equal area. Hence, each tail accounts for 2.5%. Using the proportion of 0.0250, we find z = 1.96, then z = 1.96. Another way: area below z is 0.025+0.95=0.975, which yields the same z-score z = 1.96. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 25 / 27

Examples If the normal model in the question is not standard, then using the standardization z = y µ σ to convert the non-standard into standard. Example 6.4 Assume that the length of a human pregnancy follows a normal distribution with mean 266 (days) and standard deviation 16 (days). What is the proportion that a human pregnancy lasts longer than 280 days. Denote y the length of a human pregnancy, then y N(266,16). What is the area (y > 280)? Using the standardization z = y µ σ, we convert y into z (the standard normal). Area (y > 280) = area (z > 280 266 16 = 0.875 0.88) =1-0.8106=0.1894. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 26 / 27

Examples Example 6.5 Assume a variable y is normally distributed with µ = 10 and σ = 2. Find the value that makes up the smallest 10% of this distribution. Find y such that area (y < y ) = 0.1. Equivalently, find z such that area (z < z ) = 0.1, where z = y µ σ. Note: After standardization, y becomes z, and y becomes z. But the inequality direction stays the same. From the proportion of 10%, we obtain z = 1.28. Rewriting z = y µ σ gives y = µ + σ z. Hence, y = 10 + 2 ( 1.28) = 7.44. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 27 / 27