Finding Quartiles. . Q1 is the median of the lower half of the data. Q3 is the median of the upper half of the data

Finding Quartiles. Use the median to divide the ordered data set into two halves.. If n is odd, do not include the median in either half. If n is even, split this data set exactly in half.. Q1 is the median of the lower half of the data. Q3 is the median of the upper half of the data

Data: -13,-10,-3,6,12,18,45,56,71 n= 9 is odd. Median =12 Excluding the median Data: -13,-10,-3,6 18,45,56,71 Q1 is median of first half {-13,-10,-3,6 } So, Q1 = 10+( 3) 2 Q3 is median of second half {18,45,56,71} So, Q3 = 45+56 2 Data: -13,-10,-3,6,12,18,45,56,71,96 n= 10 is even. Median = 12+18 2 Including the median Data: -13,-10,-3,6,12 18,45,56,71,96 Q1 is median of first half {-13,-10,-3,6,12 } So, Q1= 3 Q3 is median of second half {18,45,56,71} So, Q3 = 56

Chebyshev s rule For any distribution at least 1 1 of the k observations will fall within k standard 2 deviations of mean,i.e [µ-k*σ, µ+k*σ] where k 1. Chebyshev s rule is for any distribution, whereas the empirical rule is valid only for approximately symmetric unimodal (mound-shaped) distribution. If k=1, not much information is available from Chebyshev s rule. According to Chebyshev at least 75% observations fall within 2 standard deviations of mean. According to Chebyshev at least 88.9% of observations fall within 3 standard deviations of mean. 3

Examples Suppose, the mean of the height of Japanese is µ=5.5 feet and standard deviation is σ=1 feet. How much of Japanese lie between 3.5 feet and 7.5 feet? 3.5=5.5-k*1 7.5=5.5+k*1. Hence k=2 By Chebychev s rule at least at least 1 1 2 2 = 75% lie between 3.5 and 7.5 feet Between what heights you can find at least 93.75% of Japanese? 93.75%= 15 16 = 1 1 4 2. Hence k=4 By Chebychev s rule between 5.5-4*1=1.5 and 5.5+4*1=9.5 we find at least 93..75% of Japanese

Empirical rule For approximately symmetric unimodal (bellshaped/mound shaped) distribution Approximately 68% of observations fall within 1 standard deviation of mean. Approximately 95% of observations fall within 2 standard deviations of mean. Approximately 99.7% of observations fall within 3 standard deviations of mean. 5

Empirical rule 6

Empirical rule 7

Box Plot Box plot is another graphical representation of quantitative data using the following 5 number summary: 1. Minimum Value, 2. Lower Quartile, 3. Median (the middle value), 4. Upper Quartile, 5. Maximum Value. NOTE: Data must be ordered from lowest value to highest value before finding the 5 number summary. 8

Box Plots Are a representation of the five number summary (Minimum, Maximum, Median, Lower Quartile, Upper Quartile). Half the data are in the box One-quarter of the data are in each whisker. If one part of the plot is long, the data are skewed. Box-plot is very useful for comparing distributions This box plot indicates data are skewed to the left. 9

Box Plot Box Plot is a pictorial representation of the 5-number summary. 10

Outliers Any observation farther than 1.5 times IQR from the closest boundary of the box is an outlier. If it is farther than 3 times IQR, it is an extreme outlier, otherwise a mild outlier. One can also indicate the outliers in a box plot, by drawing the whiskers only up to 1.5 times IQR on both sides, and indicating outliers with stars or crosses (or other symbols). 11

An example Suppose min = 2, Q 1 = 18, median = 20, Q 3 = 22, max = 35. Which of the following observations are outliers? A. 10 B. 15 C. 25 D. 30 Lower Fence= Q 1-1.5*IQR= 18-1.5(22-18)=12 Upper Fence= Q 3 +1.5*IQR=22+1.5(22-18)=28 Note: All observations below the lower fence and above the higher fence are considered to be outliers. 12

Histogram vs. Box plot Both histogram and box plot capture the symmetry or skewness of distributions. Box plot cannot indicate the modality of the data. Box plot is much better in finding outliers. The shape of histogram depends to some extent on the choice of bins. 13

Comparing Distributions We can compare between distributions of various data-sets using Box Plots (or the 5-Number Summary), Histograms. We shall first compare distributions using box plots.

Which type of car has the largest median Time to accelerate? A. upscale B. sports C. small D. large E. family 15

Which type of car has the smallest median time value? A. upscale B. sports C. small D. Large E. Luxury 16

Which type of car always take less than 3.6 seconds to accelerate? A. upscale B. sports C. small D. Large E. Luxury 17

Which type of car has the smallest IQR for Time to accelerate? A. upscale B. sports C. small D. Large E. Luxury 18

What is the shape of the distribution of acceleration times for luxury cars? A. Left skewed B. Right skewed C. Roughly symmetric D. Cannot be determined from the information given. 19

What percent of luxury cars accelerate to 30 mph in less than 3.5 seconds? A. Roughly 25% B. Exactly 37.5% C. Roughly 50% D. Roughly 75% E. Cannot be determined from the information given 20

What percent of family cars accelerate to 30 mph in less than 3.5 seconds? A. Less than 25% B. More than 50% C. Less than 50% D. Exactly 75% E. None of the above 21

Z-Scores How to compare apples with oranges? A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? How do we compare things when they are measured on different scales? We need to standardize the values. 22

How to standardize? Subtract mean from the value and then divide this difference by the standard deviation. The standardized value = the z-score value mean z-scores are free std of.dev units.. 23

z-scores: An Example Data: 4, 3, 10, 12, 8, 9, 3 (n=7 in this case) Mean = (4+3+10+12+8+9+3)/7 = 49/7 =7. Standard Deviation = 3.65. Original Value z-score -------------------------------------------------------------- 4 (4 7)/3.65 = -0.82 3 (3 7)/3.65 = -1.10 10 (10 7)/3.65 = 0.82 12 (12 7)/3.65 = 1.37 8 (8 7)/3.65 = 0.27 9 (9 7)/3.65 = 0.55 3 (3 7)/3.65 = -1.10 -------------------------------------------------------------- 24

Interpretation of z-scores The z-scores measure the distance of the data values from the mean in the standard deviation scale. A z-score of 1 means that data value is 1 standard deviation above the mean. A z-score of -1.2 means that data value is 1.2 standard deviations below the mean. Regardless of the direction, the further a data value is from the mean, the more unusual it is. A z-score of -1.3 is more unusual than a z-score of 1.2. 25

How to use z-scores? A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? SAT score mean = 1600, std dev = 500. ACT score mean = 23, std dev = 6. SAT score 1500 has z-score = (1500-1600)/500 = -0.2. ACT score 22 has z-score = (22-23)/6 = -0.17. ACT score 22 is better than SAT score 1500. 26

Which is more unusual? A. A 58 in tall woman z-score = (58-63.6)/2.5 = -2.24. B. A 64 in tall man z-score = (64-69)/2.8 = -1.79. C. They are the same. Heights of adult women have mean of 63.6 in. std. dev. of 2.5 in. Heights of adult men have mean of 69.0 in. std. dev. of 2.8 in. 27

Using z-scores to solve problems An example using height data and U.S. Marine and Army height requirements Question: Are the height restrictions set up by the U.S. Army and U.S. Marine more restrictive for men or women or are they roughly the same? 28

Data from a National Health Survey Heights of adult women have mean of 63.6 in. standard deviation of 2.5 in. Heights of adult men have mean of 69.0 in. standard deviation of 2.8 in. Height Restrictions Men Minimum Women Minimum U.S. Army 60 in 58 in U.S. Marine Corps 64 in 58 in 29

Heights of adult men have mean of 69.0 in. standard deviation of 2.8 in. Heights of adult women have mean of 63.6 in. standard deviation of 2.5 in. Men Minimum 60 in Women minimum 58 in U.S. Army U.S. Marine z-score = -3.21 Less restrictive 64 in z-score = -1.79 z-score = -2.24 More restrictive 58 in z-score = -2.24 More restrictive Less restrictive 30

Effect of Standardization Standardization into z-scores does not change the shape of the histogram. Standardization into z-scores changes the center of the distribution by making the mean 0. Standardization into z-scores changes the spread of the distribution by making the standard deviation 1. 31

Z-score and Empirical Rule When data are bell shaped, the z-scores of the data values follow the empirical rule. 32

Outlier detection with z-score Empirical Rule tells us that if data are mound-shaped distributed, then almost all the data-points are within plus minus 3 standard deviations from the mean. So an absolute value of z-score larger than 3 can be considered as an outlier. 33

2004 Olympics Women s Heptathlon Austra Skujyte (Lithunia) Shot Put = 16.40m, Long Jump = 6.30m. Carolina Kluft (Sweden) Shot Put = 14.77m, Long Jump = 6.78m. Mean (all contestant) Shot Put Long Jump 13.29m 6.16m Std.Dev. 1.24m 0.23m n 28 26 34

Which performance was better? A. Skujyte s shot put, z-score of Skujyte s shot put = 2.51. B. Kluft s long jump, z-score of Kluft s long jump = 2.70. C. Both were same. Mean (all contestant) Shot Put Long Jump 13.29m 6.16m Std.Dev. 1.24m 0.23m n 28 26 35

Based on shot put and long jump whose performance was better? A. Skujyte s, z-score: shot put = 2.51, long jump = 0.61. Total z-score = (2.51+0.61) = 3.12. B. Kluft s, z-score: shot put = 1.19, long jump = 2.70. Total z-score = (1.19+2.70) = 3.89. C. Both were same. 36