Lecture 1: Descriptive Statistics

Lecture 1: Descriptive Statistics MSU-STT-351-Sum 15 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 1 / 56

Contents 1 Introduction 2 Branches of Statistics Descriptive Statistics 3 Histogram 4 Numerical Summary of Measures 5 Measure of Variability 6 Homework (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 2 / 56

Introduction Why Statistics? (i) It is the science that helps to understand many phenomena which occur in the field of engineering, science, economics, finance, biology, and etc. (ii) It is the scientific way that helps to make intelligent judgments/decisions from the observed data which contains uncertainty and variation. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 3 / 56

Introduction We start with two examples. Example 1. The emission levels of HC (hydrocarbon) and CO (carbon monoxide) of a vehicle: HC (gm/mile): 12.8 18.3 32.2 32.5 CO (gm/mile): 118 149 232 236 Question: What is the emission level of HC/CO? It is difficult to make a precise statement, as there is a high variation in the observed levels. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 4 / 56

Introduction Example 2. Marks of two students in 4 tests: S1: 25 38 42 39 S2: 85 62 78 59 Question: Who is doing better? Any difficulty in answering? No need for statistical analysis. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 5 / 56

Introduction What is statistics? (i) One word definition: (a) Economics: Money (b) Philosophy: Why (c) Statistics: Variation (ii) Layman definition: Information/summary of data. (iii) Formal Definition: Statistics deals with techniques to deal with or how to (a) obtain information/data (sample) (b) analyze scientifically the data (c) draw valid conclusions/inference (iv) As a branch of mathematics, it deals with analytical techniques to analyze the data to infer about the population characteristics. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 6 / 56

Introduction Population and Samples Population: The set of all well-defined objects/elements (of interest) which are under investigation. Example 1. The students studying engineering at MSU. Example 2. The population of East Lansing. If we can collect information on all the elements in the population, we call it Census. Most often, it is impossible, as it involves a lot of time, efforts and money. Sample: A subset of the population, which is selected for obtaining information. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 7 / 56

Introduction Example 3. We may select 10 students from each discipline from MSU. Often, we are interested in certain characteristics of the population (number of flaws in a piece of cloth; thickness of a capsule wall, monthly income of an individual etc). A characteristic may be (i) Categorical (belongs to one of the categories) (a) Gender of a student (male/female) (b) Quality of a product (excellent/good/bad) (ii) Numerical (measured in real value) (a) Heights of students (b) Values of a stock (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 8 / 56

Introduction Types of Variables A variable is any characteristic which changes over the objects in the population. It is denoted by x, y, z (or by X, Y, Z). The variables X may be categorical (called categorical variable) or numerical/quantitative (called numerical variable). (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 9 / 56

Introduction Types of Data (i) The data X 1, X 2,..., X n (or x 1, x 2,..., x n ) on a categorical variable X is called categorical data. (ii) The data X 1, X 2,..., X n (or x 1, x 2,..., x n ) on a numerical variable X is called quantitative data. Suppose we measure height = x, and weight = y on n-individuals, (x 1, y 1 ),..., (x n, y n ). Then we have the bivariate data. Similarly, multivariate data is defined. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 10 / 56

Branches of Statistics (i) Descriptive Statistics: Deals with summarizing and describing important features (such as mean, median, standard deviation) of data (tabulating or graphical methods). (ii) Inferential Statistics: Deals with techniques for drawing inferences (generalizing to population) and predictions about the population, based on the information obtained from the sample. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 11 / 56

Descriptive Statistics Descriptive Statistics 1.2 Graphical (visual) Display of Univariate Data Pictures often reveal useful information about data. 1.2.1 Graphs for Quantitative Data (i) Stem-and-Leaf Display (Stem Plot) This is an useful plot for displaying quantitative data. Example 4. Consider the data on the pulse rates (per minute) of 10 patients: 45, 61, 60, 62, 65, 73, 75, 75, 78, 82 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 12 / 56

Descriptive Statistics (i) Stem-and-Leaf Display (Stem Plot) Stem plot gives Actual values Extent of spread Number and location of peaks Presence of any outlier 8 2 7 6 5 4 3 5 5 8 0 1 2 5 } ga 5 outlier Stem: Tens Leaf: Ones digit (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 13 / 56

Descriptive Statistics (ii) The Dot plot used when data is small or has few distinct values. Here each observation is represented by a dot on a horizontal scale. 40........ 50 60 70 80. This is similar to stem plot, except that dot is used instead of integers. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 14 / 56

Descriptive Statistics Definition 1 A (quantitative) variable X is discrete if it takes finite or countable values. It is continuous if it takes all values in an interval or of the whole real line. Example 5. Let X = number of trials to get the first success. Then X {1, 2,...} and hence X is discrete. Suppose, X = height of a student (in cm). Then X [150, 190] and is a continuous variable. Let X be a discrete variable taking values in {1, 2,..., I} = S. Let X 1,..., X n be n data values on X. Then frequency of i S = Number of values in the data {X 1, X 2,..., X n } equal to i. For 1 i I, the relative frequency of i = frequency of i/n. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 15 / 56

Descriptive Statistics Example 6. Let X = Number of children in a family. Then X 0, 1, 2, 3. Also, suppose the data on 20 families in East Lansing are: 2, 0, 1, 2, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2. Then the frequency table is X Frequency Relative Frequency 0 1 2 3 1 6 9 4 1/20 = 0.05 6/20 = 0.3 9/20 = 0.45 4/20 = 0.20 Total 20 1.0 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 16 / 56

Frequency Descriptive Statistics (iii) Histogram for Discrete Data Take x-values on horizontal scale and the frequency/relative frequency along the vertical scale. Draw the rectangle on each value whose height is equal to the frequency/relative frequency. The histogram for Example 6 is: Histogram of C1 9 8 7 6 5 4 3 2 1 0 0 1 C1 2 3 Similarly, relative frequency histogram may be drawn. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 17 / 56

Descriptive Statistics Histogram for Continuous Data (measurements) Case 1. (Equal Width Case) (i) The data assumes real values, not necessarily integers. (ii) Subdivide the range of the data into k subintervals or classes of equal length such that each observation lies exactly in one class. (iii) Construct rectangles whose height is equal to frequency (for frequency histogram) or relative frequency (for relative frequency histogram). Note: (i) No hard-and-fast rules concerning k; usually, an integer between 5 and 20 will do. (ii) For large data of size n, more classes be used. A rule of thumb is k = n. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 18 / 56

Descriptive Statistics Note: If all data belong to one or two classes or when most sub-intervals (of equal length) have low frequencies, better to use fewer but with different lengths... (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 19 / 56

Descriptive Statistics Histogram: For classes of different lengths: (i) (ii) Decide the class intervals. Construct the rectangle using the formula: Rectangle height=relative frequency/class width (area of rectangle=relative frequency) (iii) (iv) The resulting rectangle heights are called densities The formula works for equal width also. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 20 / 56

Descriptive Statistics Example 7. The following data represents the frequency distribution of the fracture strength (MPa) observations for ceramic bars fired in a particular kiln: (read 81 83 = 81 < 83 meaning that the data value 83 is not included) Class: 81 83 83 85 85 87 87 89 89 91 91 93 93 95 95 97 97 99 Freq: 6 7 17 30 43 28 22 13 3 (a) Construct a histogram based on relative frequencies, and comment on any interesting features. (b) What proportion of strength observations are at least 85? Less than 95? (c) What proportion of the observations are less than 90? (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 21 / 56

Descriptive Statistics Solution: (a) The histogram appears below. A representative value for this data would be X = 90. The histogram is reasonably symmetric, unimodal, and somewhat bell-shaped. The variation in the data us not small since the spread of the data (99 81) = 18 constitutes about 20% of the typical value of 90. Relative frequency.20.10 0 81 83 85 87 89 91 93 Fracture strength (MPa) 95 97 99 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 22 / 56

Descriptive Statistics (b) The proportion of the observations that are at least 85 is 1 (6 + 7)/169 = 0.9231. The proportion less than 95 is 1 (22 + 13 + 3)/169 = 0.7751. (c) Note x = 90 is the midpoint of the class 89 < 91, which contains 43 observations (a relative frequency of 43/169=0.2544). Therefore, about half of this frequency, 0.1272, should be added to the relative frequencies for the classes to the left of x = 90. That is, approximate proportion of the observations that are less than 90 is 0.0355 + 0.0414 + 0.1006 + 0.1775 + 0.1272 = 0.4822. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 23 / 56

Histogram Shapes The histogram shape is called (a) unimodal if it has single peak. Note: The histogram seen earlier is unimodal. frequency 25 20 15 10 5 0 0 10 Flow rate 20 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 24 / 56

Histogram Shapes (b) Bimodal if it has 2 different peaks. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 25 / 56

Histogram Shapes (c) Multimodal if it has > 2 peaks. Symmetric if it is unimodal and right half is the mirror image of the left half. F r e q u e n c y 15 10 5 0 10 20 30 40 50 I D T v a lu e 60 70 80 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 26 / 56

Histogram Shapes (d) Positively skewed if the right tail is stretched out compared with the left tail. (e) Negatively skewed if left tail is stretched out compared with right tail. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 27 / 56

Histogram for Qualitative/Categorical Data (i) Histogram for categorical data is called bar chart. There will be natural ordering of classes. (Titanic Data) (ii) A Pareto diagram is a bar chart resulting from quality control study where different categories correspond to different defects (non-conformities). Example 8. Histogram for Titanic Data: The following table classifies 2201 people as per the class they traveled: Class: First (F) Second (S) Third (T) Crew (C) Count: 325 285 706 885 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 28 / 56

Histogram for Qualitative/Categorical Data Histogram for Titanic Data 1000 900 800 700 600 500 400 300 200 100 0 F S T C (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 29 / 56

Some Additional Examples Some Additional Examples: Example 1. Construct the stem-and-leaf display for the data on flexural strength of a certain concrete (in MPa units): 5.9, 7.2, 7.3, 6.3, 8.1, 6.8, 7.0, 7.6, 6.8, 6.5, 7.0, 6.3, 7.9, 9.0, 8.2, 8.7, 7.8, 9.7, 7.4, 7.7, 9.7, 7.8, 7.7, 11.6, 11.3, 11.8, 10.7 (a) Is it spread about a representative value? (b) Is it symmetric? (c) Any outliers? (d) What proportion of observations exceed 10 MPa? (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 30 / 56

Some Additional Examples Solution: (a) Minitab generated the following stem-and-leaf display of this data: Stem-and-leaf of C1 N = 27 Leaf Unit = 0.10 1 5 9 6 6 33588 (11) 7 00234677889 10 8 127 7 9 077 4 10 7 3 11 368 The left most column shows the cumulative numbers of observations from each stem to the nearest tail of the data. For example, the 6 in the second row indicates that there are a total of 6 data points contained in stems 6 and 5. Minitab uses parentheses around 11 in row three to indicate that the median of the data is contained in this stem. A value close to 8 is representative of this data. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 31 / 56

Some Additional Examples (b) The data display is not perfectly symmetric around some middle/representative value. There tends to be some positive skewness in this data. (c) The outliers are data points that appear to be very different from the pack. Looking at the no stem-and-leaf display in Part (a), there appear to be no outliers in this data. (a more precise definition of an outlier will be given later). (d) From the stem-and-leaf display in Part (a), there are 3 leaves associated with the stem of 11, which represent the 3 data values that greater than or equal to 11. 10.7, which is represented by the stem of 10 and the leaf of 7, also exceeds 10. Therefore, the proportion of data values that exceed 10 is 4/27 = 0.128, or, about 15%. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 32 / 56

Some Additional Examples Example 2. The following data represents the IDTs (inter-division time) of a number of cells both in exposed (treatment) and in unexposed (control) conditions: 28.1, 31.2, 13.7, 46.0, 25.816.8, 34.8, 62.3, 28.0, 17.9, 19.5, 21.1, 31.9, 28.9, 60.1, 23.7, 18.6, 21.4, 26.6, 26.2, 32.0, 43.5, 17.4, 38.8, 30.6, 55.6, 25.5, 52.1, 21.0, 22.3, 15.5, 36.3, 19.1, 38.4, 72.8, 48.9, 21.4, 20.7, 57.3, 40.9 Construct a histogram of this data based on classes with boundaries 10, 20, 30,... Then calculate log(x) to the (base 10) for each x and construct the histogram of the transformed data using the class boundaries 1.1, 1.2, 1.3, and etc. What is the effect of the transformation? (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 33 / 56

Some Additional Examples Solution. A histogram of the raw data appears below: (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 34 / 56

Some Additional Examples After transforming the data by taking logarithms (base 10), a histogram of the log 10 data is shown above. The shape of this histogram is much less skewed than the histogram of the original data. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 35 / 56

Numerical Summary of Measures We now discuss some of the important characteristics of the data and for the population. Measures of Location First we discuss for data and then for the population distribution. The Mean 1. The Sample Mean: x The sample of mean of n observation x 1,..., x n is x = 1/n x i = (x 1 +... + x n )/n, where, n denotes the number of observations. Example 1a. Suppose scores of 8 students in a test are: 35, 20, 45, 50, 42, 38, 39, 11. Then the sample mean is = 280/8 = 35. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 36 / 56

Numerical Summary of Measures Example 1b. Suppose, the last score is recorded, by mistake, as 71. Then, x = (269 + 71)/8 = 340/8 = 42.5%. About 22% increase in the sample mean. Note this is signifiant one. Rule: Increase one decimal place more than the one present in the data. In the above example, the data are in integers (no decimal places) and so we denoted x = 42.5 (one decimal place) (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 37 / 56

Numerical Summary of Measures 2. The Median: x This measure is less affected by outliers or extreme values. This divides the sample distribution in to two equal parts. Definition 2 (Sample median) First order the observations as X (1) X (2)... X (n), from the smallest to the largest one. Then the median is defined as X ( n+1 2 ), if n is odd, x = ( ) X ( n 2 ) + X ( n 2 +1) /2, if n is even { middle Value, if n is odd, = average of middle 2 values, if n is even. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 38 / 56

Numerical Summary of Measures Example 2: The median of the values in Example 1a is: 11, 20, 35, 38, 39, 42, 45, 50 }{{} Here, n = 8 even; n/2 = 4. Take the middle values: 4th and 5th values. Hence, the median is x = average of middle two values = {(38 + 39)/2} = 38.5. Example 3: Find the median of Example 1b (one outlier case). Here, 20, 35, 38, 39, 42, 45, 50, 71 }{{} Again, x = (39 + 42)/2 = 81/2 = 40.5 Remark. 1 (i) The median value is less affected than the mean. (ii) Also, this is an extreme case, as we replaced the smallest observation by one which is greater than the largest. (iii) Decreasing the first three smallest values or increasing the last three largest values in Example 3, does not affect the median. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 39 / 56

Numerical Summary of Measures 3. The Trimmed Mean (i) First order the observations (ordered data) from the smallest to the largest. (ii) Let r (0, 0.5). Then 100r% trimmed data is obtained by discarding the largest 100r% and the smallest 100r% of the data. Definition: The 100r% trimmed simple mean is the sample mean of the 100r% trimmed data. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 40 / 56

Numerical Summary of Measures Example 4. Obtain the 12% trimmed mean of the data in Example 1. 11, 20, 35, 38.39.42.45.50. Here, 12 = 100r% (100r = 12, r = 12/100 = 0.12) Also n = 8; 12% of 8 = (12/100) 8 = 24/25 1. Discarding the smallest one and the largest one, we get 12.5% trimmed means (since (1/8) = 12.5) is (20 + 35 + 38 + 39 + 42 + 45)/6 = 219/6 = 36.5. Remark. 2 It is less sensitive than the mean, but more sensitive than the median. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 41 / 56

Measure of Variability Let x 1,..., x n be a sample of size n on a variable x. Definition 3 (i) The Range: Arrange x 1,..., x n as x (1) x (2)... x (n), where x (1) = smallest value and x (n) = the largest value. Then the range R = x (n) x (1). This is the simplest measure of variability. Drawback: It depends only on x (1) and x (n). (ii) The Sample Variance The sample variance of x 1,..., x n is defined by s 2 x = 1/(n 1) n (x i x) 2 = S xx /(n 1) i=1 and the sample standard deviation is s = + s 2, the positive square root. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 42 / 56

Measure of Variability Facts: (i) The unit of s is the same as that of x i s. (ii) n (x i x) = 0, for any x 1,..., x n. i=1 That is, if the derivations (x 1 x),..., (x n 1 x) are known, then (x n x) can be found. Thus, n deviations actually contain only (n 1) independent pieces of information (called degrees of freedom) and this will suffice to find s 2 or s. Thus, s 2 or s are based on (n 1) degrees of freedom. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 43 / 56

Measure of Variability A Useful Formula: S xx = = = n (x i x) 2 i=1 x 2 i ( x i ) 2 /n x 2 i nx 2. Hence, Sx 2 = 1 [ n 1 i x 2 i 1 ( ) 2 x i ]. n A Proposition: Let S 2 x be the variance of the data x 1,..., x n and c 0. (i) If y 1 = x 1 + c,..., y n = x n + c, then S 2 y = S 2 x. (ii) If y 1 = cx 1,..., y n = cx n then S 2 y = c 2 S 2 x and S y = c S x. i (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 44 / 56

Measure of Variability Example 5 The following data represents the value of Young s modulus for certain cast plates: 116.4, 115.9, 114.6, 115.2, 115.8. (a) Find x and (x i x) (b) Using (x i x) s, compute S 2 (c) Calculate using computational for S xx. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 45 / 56

Measure of Variability Solution: (a) x = 1/n i x i = 577.9/5 = 115.58. Deviations from the mean: 116.4 115.58 =.82, 115.9 115.58 =.32, 114.6 115.58 =.98, 115.2 115.58 =.38, and 115.8 115.58 =.22. (b) s 2 = [(.82) 2 + (.32) 2 + (.98) 2 + (.38) 2 + (.22) 2 ]/(5 1) = 1.928/4 =.482. Hence, s = 0.482. (c) i x 2 = 66, 795.61, i [ ( ) 2 ] so S 2 = 1 n 1 i x 2 1 i n i x i = [66795.61 (577.9) 2 /5]/4 = 1.928/4 = 0.482. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 46 / 56

Measure of Variability Box Plot The quartiles and percentiles yield more information about the location of a data set. Similarly, median and IQR (inter quartile range) are used to construct box plot, a visual summary of the data. Quartiles and IQR Let x 1,..., x n denote the data set of size n. First order the observations. (i) Compute the median x. (ii) If n is even, first n 2 observations form the lower half; and the remaining n 2 observations form the upper half (median separates the data into two parts). (iii) If n is odd, the median x is the (n+1) 2 th value of the ordered data and include it both the parts. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 47 / 56

Measure of Variability The Quartiles: (i) The lower quartile= Q 1 = median of the lower half of the data. (ii) The upper quartile= Q 3 = median of the upper half of the data. (iii) The interquartile range IQR = Q 3 Q 1 Note: The IQR is also called fourth spread f s = Q 3 Q 1 = upper fourth - lower fourth, and is resistant to outliers. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 48 / 56

Measure of Variability Example 1 Consider the following data: 5.2, 3.9, 4.8, 5.1, 3.7, 4.5, 4.2 Here, n = 7. Ordered data: 3.7, 3.94.2, 4.5, 4.8, 5.1, 5.2. The median = 4.5. Since n is odd, include the median in lower half and upper half of the data. Lower half: 3.7, 3.9, 4.2, 4.5 Upper half: 4.5, 4.8, 5.1, 5.2 Q 1 = 3.9+4.2 2 = 8.1 2 = 4.05 Q 3 = 4.8+5.1 2 = 9.9 2 = 4.95 Hence, IQR = 4.95 4.05 = 0.9. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 49 / 56

Measure of Variability IQR Criteria for an Outlier: An observation that lies above Q 3 + (1.5)IQR or below Q 1 (1.5)IQR may be suspected to be an outlier. An outlier is called extreme if it lies outside (Q 1 3IQR, Q 3 + 3IQR). Otherwise; it is called a mild outlier. Boxplot: A box plot is a visual display of 5 number summary: (x (1), Q 1, x, Q 3, x (n) ). (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 50 / 56

Measure of Variability Procedure: (i) The middle box denotes the Q 1, median and the Q 3. (ii) The whiskers extend above Q 3 or below Q 1 till Q 3 + 3IQR or Q 1 3IQ, respectively. (iii) The outliers are denoted by special symbols. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 51 / 56

Measure of Variability Remark. 3 The box-plot has the following properties: (i) More compact than stem plot or histogram. (ii) Central box contains roughly 50% of the data. (iii) Does not reveal the presence of clusters. (iv) Very useful in comparing (similarity and differences) data sets on same scale. (v) Height of the box = IQR (vi) If the median is roughly in the middle of the box, then the distribution is symmetric; or else it is skewed. (vii) Whiskers show skewness if they are not of the same length. (viii) Useful to detect outliers. The main use of box plots is to compare the groups. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 52 / 56

Measure of Variability Example 3 The following data denotes the shear strength (MPa) of a joint bonded in a particular manner. 22.2, 40.4, 16.4, 73.7, 36.6, 109.9, 30.0, 4.4, 33.1, 66.7, 81.5 (a) What are the values of the quartiles, and the value of the IQR? (b) Construct a box plot based on the five-number summary, and comment on its features. (c) How large or small does an observation have to be to qualify as an outlier? As an extreme outlier? (d) By how much could the largest observation be decreased without affecting the IQR? (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 53 / 56

Measure of Variability Solution: (a) The lower half of the data set: 4.4, 16.4, 22.2, 30.0, 33.1, 36.6, and therefore the lower quartile is ((22.2 + 30.0)/2) = 26.1. The top half of the data set: 36.6, 40.4, 66.7, 73.7, 81.5, 109.9 and therefore the upper quartile, is ((66.7 + 73.7)/2) = 70.2. So, the IQR = (70.2 26.1) = 44.1. (b)a boxplot (created in Minitab) of this data appears below: There is a slight positive skew to the data. The variation seems quite large. There are no outliers. 0 50 sheer strength 100 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 54 / 56

Measure of Variability (c) An observation would need to be further than 1.5(44.1) = 66.15 units below the lower quartile or above the upper quartile to be classified as a mild outlier. Notice that, in this case, an outlier on the lower side would not be possible since the sheer strength variable cannot have a negative value. An extreme outlier would fall (3)(44.1) = 132.3 or more units below the lower, or above the upper quartile. Since the minimum and maximum observations in the data are 4.4 and 109.9 respectively and so there are no outliers, of either type, in this data set. (d) Not until the value x = 109.9 is lowered below 73.7 would there be any change in the value of the upper quartile. That is, the value x = 109.9 could not be decreased by more than (109.9 73.7) = 36.2 units. (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 55 / 56

Homework Home work: Sect 1.2: 11, 16, 19, 26, 27, 29 Sect 1.3: 35, 36, 41, 43 Sect 1.4: 45, 51, 54, 57, 79. END OF LECTURE 1 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 56 / 56