Descriptive Statistics Summarizing a Single Variable Reference Material: - Prob-stats-review.doc (see Sections 1 & 2) P. Hammett - Lecture Eercise: desc-stats.ls 1
Topics I. Discrete and Continuous Measurements II. III. IV. Samples Versus Population Types of Descriptive Statistics A. Location: Mean, Median, Mean Bias B. Dispersion: Range, Standard Deviation, Variance Computing statistics using software 2
I. Discrete Vs. Continuous Variables Discrete variables - vary by whole units # of students in class, # errors in report, sum of rolling 2 die Continuous variables - vary to any degree, limited only by precision of measurement system. Height of students in a class, Length of an object, Time to complete a task Precision of Measurement System Concept: Continuous variables may always be broken down further with greater measurement precision. For eample, Time could be: 10 sec, 10.0 sec, 10.01 sec, 10.008 sec Note: all variable measurements have units! 3
Attributes (Categorical Data) Vs. Variables For attributes (e.g., defective / not defective), we typically use counts and % to communicate. (e.g., 30% defective). For discrete or continuous variables, we typically use descriptive statistics to communicate/ summarize. e.g., average time to process a loan is 41 days Note: average is a descriptive statistic In using descriptive statistics, we must recognize if we are summarizing a population or a sample (of some sample size.) 4
II. Samples Vs. Population When describing a variable, we often collect a sample of data from a population. Population - All items in a set (obtain via census). Describe populations using parameters such as the population mean ( µ ) or standard deviation (σ) Sample - Subset of Population. Estimate parameters using statistics, mean ( X ) Eample: suppose you close 8000 loans. You might measure a sample of 100 from among these 8000 (population) to assess if you are meeting your requirements for time to process loan. 5
Population Eample (all possible outputs are known) What is the population for all possible combinations of the sum of rolling two die? Combination Sum Frequency (1,1) 2 1 (1,2) (2,1) 3 2 (1,3) (3,1) (2,2) 4 3 (1,4) (4,1) (2,3) (3,2) 5 4 (1,5) (5,1) (2,4) (4,2) (3,3) 6 5 (1,6) (6,1) (2,5) (5,2) (3,4) (4,3) 7 6 (2,6) (6,2) (3,5) (5,3) (4,4) 8 5 (3,6) (6,3) (4,5) (5,4) 9 4 (4,6) (6,4) (5,5) 10 3 (5,6) (6,5) 11 2 (6,6) 12 1 Total 36 6
Understanding Samples and Populations Suppose you roll two die 10 times (10 samples) and observe the following sum combinations. Is this sample representative of the population? 5 4 Frequency 3 2 1 0 2 3 4 5 6 7 8 9 10 11 12 Sum of Rolling Pair of Dice 7
Understanding Samples and Populations Suppose you roll 50 samples, Have you observed every possible value? Now, is this sample representative? 8 Sum of Rolling a Pair of Dice Frequency 6 4 2 0 2 3 4 5 6 7 8 9 10 11 12 Roll 8
Understanding Samples and Populations As you increase sample size, you will eventually obtain a representative sample of a population. The challenge is how many samples are needed to be representative! 600 n=3000 samples Frequency 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10 11 12 Sum of Rolling Pair of Dice 9
Population Eample: Continuous Data (all possible combinations are unknown) Usually, all possible combinations are not known. Suppose you monitor the time to complete orders (min), but you do not keep track of every order. Instead, you take samples. 10
Samples from Continuous Populations Suppose you take a set of 3 samples from the population with the following order times (min) 1219.1, 1220.1, 1220.5 1220.1 1219.1 1220.5 11
Samples from Continuous Populations If you take another set of 3 samples from this population, you likely will get a different set of values. 1219.5 Sample Set 2 Sample Set 1 1220.25 1218.5 12
Samples from Continuous Populations As total number of samples become large, they likely will converge or form a pattern (if population does NOT change.) This pattern is known as the Underlying Distribution Underlying Distribution shown below is a Normal Distribution 13
Sample Size and Population Representation (Variable Data) Determining # samples to identify underlying distribution is an advanced skill and requires several assumptions. For a normal distribution, confidence in estimating the distribution variance jumps significantly from ~10 ~30 samples and then begins to level off around ~100 and usually more than 300 is unnecessary. Variance Lower 95% Confidence Interval (true variance = 1) 1.00 0.80 0.60 0.40 0.20 0.00 N=10 N=30 N=100 N=300 0 50 100 150 200 250 300 350 400 450 Sample Size 14
Key Sampling Concepts Key Sampling Concepts: You don t need to measure every observation to understand a population. Knowledge of a population increases with the number of samples, BUT eventually the value of additional information diminishes. The notion that we may understand populations by only measuring samples drives the field of statistics. 15
Fields of Statistics Descriptive Statistics Summarize or describe important features in a data set without attempting to infer conclusions. Describe data samples using items such as: X-bar (sample mean) and S (standard deviation). These statistics are used to estimate the population mean (µ) and population sigma σ. Inferential Statistics Use sample of data to draw conclusions (make inferences). Eample: Suppose you compare order times from 2 processes. Process A averages 12.10 min and B averages 12.22 min. We may use inferential statistics to assess if the two processes have significantly different averages. 16
III. Descriptive Statistics Most commonly used descriptive statistics are related to either measuring location or dispersion. Location Statistics Eamples: Mean, Median, Mean Bias Dispersion Statistics Eamples: Range, Standard Deviation, Variance 17
Location and Dispersion Location ~ central tendency Dispersion ~ spread of distribution Classic eample to demonstrate these concepts: Playing Darts On or Off Location Low or High Dispersion 18
Lecture Eercise: Identify On/Off Target & High/Low Dispersion for each A. B. C. D. 19
Location and Dispersion High Dispersion Off Location High Dispersion On Location Low Dispersion Off Location Low Dispersion On Location 20
Quality Problem Solving A General Approach Address problems in order of importance. Priority features that have strong cause-effect relationship with customer satisfaction. In addressing problems, typically first try to reduce variation, then shift mean as necessary to meet endcustomer needs. Stablize process Center Process as necessary 21
Eecuting Quality Problem Solving Approach In solving quality problems, we need useful estimates of: location (e.g., mean) and dispersion (e.g., variation). 22
A. Measures of Location Mean Median Mean Bias 23
Mean Mean (also known as the average) is a measure of the center of a distribution. Typical notation used to represent the mean of a sample of data is X ; Greek letter µ is used to represent the mean of a population. Mean = X X 2 N 1 + +... X Eample: suppose five students take a test and their scores are 70, 68, 71, 69 and 98. Mean = (70+68+71+69+98)/5 = 75.2 N Ecel: =average(array) 24
Median Median (also known as the 50 th percentile) is the middle observation in a data set. Rank the data set and select the middle value. If odd number of observations, the middle value is observation [N + 1] / 2. If even number of observations, the middle value is etrapolated as midway between observation numbers N / 2 and [N / 2] + 1. Prior data values:68, 69, 70, 71, and 98. Median is 70. If another student with a score of 60 was included, the new median would 69.5 (69 + 70 / 2). Ecel: =median(array) 25
Mean Vs. Median Which is a better measure of location for the following set of test scores? 68, 70, 69, 71, and 98 Mean = 75.2 Median = 70.0 26
Mean Vs. Median Which is a better measure of location for the following set of test scores? 68, 70, 69, 71, and 98 Mean = 75.2 Median = 70.0 Be careful with mean if etreme values are present (e.g. High score ~ 98!) 27
Mean Bias Mean bias absolute deviation of the mean from a target or nominal value. Mean Bias = Mean Target Eample: if average length = 1219.7 min and target = 1220 min, then mean bias = Mean Bias = 1219.7 1220 = 0.3 min Note: Mean Bias is non-directional. For instance, in the above eample, if mean = 1220.3, the bias would also be 0.3 min. 28
B. Measures of Dispersion Range Standard Deviation Variance 29
Range Range is the maimum value in a data set minus the minimum value. Eample: Test Scores: 70, 68, 71, 69 and 98. Range = 98-68 = 30. Note: the range is often preferred over the standard deviation for small data sets (e.g., if # of observations for a sample data set < 10). 30
Standard Deviation Standard deviation (StDev), sigma, S measures the dispersion of the individual observations from the mean. For a sample data set, standard deviation is also referred to as the sample standard deviation or the root-mean-square S rms S = or S = i= 1 n n n i= 1 ( X X ) i n 1 X 2 i n ( 2 ( n 1) X i ) 2 31
Standard Deviation (Sigma) The standard deviation is very useful in describing the variation about the mean if the data are normally distributed. 3σ 2σ 1σ +1σ+2σ+3σ +/- 1σ = 68.26% +/- 2σ = 95.46% +/- 3σ = 99.73% For a Normally distributed variable, we epect 99.73% of all values to fall within +/- 3 std deviations of the mean. 32
Order Time Eample (1000 Measurements) If Mean = 1220 and standard deviation = 0.5 min, then 99.73% of all values will be epected to fall between 1218.5 and 1221.5 (+/- 3σ) 33
Effects of Etreme Values Test scores: 70, 68, 71, 69 and 98, sample standard deviation is 12.79. Suppose you eclude the score of 98, sample standard deviation is reduced to 1.3! Standard deviation may be severely influenced by etreme values in sample data set (Note: they may not necessarily be outliers). We may reduce the effects of individual observations by increasing the sample size. 34
Variance Variance is the square of the standard deviation. Represents the average squared deviation of each observation from the sample mean. S 2 ( X X ) Prior Eample where std deviation = 12.79 Variance = (12.79) 2 = 163.72 = n i= 1 i n 1 2 35
Variance Additive Property Variance is often used instead of standard deviation because of its additive properties when combining multiple sources of variation. Suppose you have process time from two independent processes: A and B. Proc A Proc B σ 2 AB = σ2 A + σ2 B Overall AB σ AB = σ A + σ B X Not True! 36
IV. Using Software to Calculate Descriptive Statistics In practice, we rarely calculate statistics by hand. So, let us eplore some useful Ecel functions. Count (N) =count(array) Mean =average(array) Median =median(array) Std Dev =stdev(array) Variance =var(array) Range =ma(array)-min(arrary) Or, we may use QETools for calculations. 37
Eample Using Ecel Given our prior test scores in cells: B2:B6, we can compute the mean (average) by using the formula =average(b2:b6) 1 2 3 4 5 6 7 8 A B C Observation Score 1 68 2 70 3 69 4 71 5 98 Average 75.2 =Average(B2:B6) 38
Lecture Eercise: Compute Descriptive Statistics Given Ecel file desc-stats.ls, compute statistics for test scores. Count (N), Mean, Median, Range, Std Dev, Variance 39
Or, Use QE Tools Ecel file desc-stats.ls Score Sample N 16 Mean 82.78 Median 83.50 StDev 9.17 Variance 84.07 Min 63.00 Ma 95.00 Range 32.00 Sample Results from QETools 40