Topic 2 - Descriptive Statistics STAT 511 Professor Bruce Craig Types of Information Variables classified as Categorical (qualitative) - variable classifies individual into one of several groups or categories Ordinal - Letter grade, Fitness Level Not ordinal - Eye color, Types of Damage Quantitative - variable takes on numerical values for which arithmetic operations make sense Continuous - Height, Serum creatinine level Discrete - Number of defects or successes Background Reading Devore : Section 1.2-1.4 Can often convert ordinal to quantitative and vice versa 2 2-1 Examples Sample and Sample Size Will denote variables by upper case letters (X) Will denote observations by lower case letters (x) The number of accidents on Interstate 65 in the month of December was 10 Individual = Interstate 65 Variable Y = the number of accidents in Dec Observation y = 10 The following table (WSJ - 1997) summarizes a week s beer advertising on TV % Viewers Advertiser Show (Network) Date (Time) < 21 Coors Light Hit List (BET) Sept 2 (8:00 pm) 51 Molson Singled Out (MTV) Sept 2 (7:00 pm) 52 Molson Ice Beavis and Butthead (MTV) Sept 2 (11:30 pm) 48 Foster s Singled Out (MTV) Sept 3 (11:00 pm) 46 Molson Real World (MTV) Sept 3 (8:30 pm) 45 Foster s Melrose Place (E!) Sept 2 (7:00 pm) 41 Miller Unreal (BET) Sept 5 (8:00 pm) 65 Schlitz Yo MTV (MTV) Sept 5 (10:00 pm) 50 Molson Beavis and Butthead (MTV) Sept 6 (10:30 pm) 69 Budweiser Video Music Awards (MTV) Sept 7 (8:30 pm) 46 Describe variables from these examples 2-2 If data set collected under identical conditions or can be considered to be drawn from a population, the data set is called a sample The sample size represents the number of individuals in the sample. Usually represented by the n. Example: Twenty five rolls of a fair die Individual - Roll of die Variable - Face value of die Observation - 1,2,3,4,5, or 6 Sample size - n =25 2-3
Frequency Distributions Graphical summaries To understand data set, must first be able to explore and summarize the information Frequency distribution of variable describes possible values of variable and how frequently each value is an observation Distribution can be summarized in tabular or graphical form Provide visual display of distribution Allows one to examine shape of data Allows one to compare data sets Can check assumptions of statistical tests Easier to read than table or text summary 2-4 2-5 Graphs for a Categorical Variable Displays categories and frequencies (counts) Bar Chart (vertical) Categories listed along horizontal axis Bar extends vertically to represent count Pareto Chart Bar chart with categories ordered from most frequent to least frequent Example Complete the following table and construct graphs Type of Books Purchased at Bookstore Type Count Percent Textbooks 1200 Non-Fiction 100 Fiction Children s 100 Total 2000 Pie Chart Each pie section (wedge) represents count for specific category 2-6 2-7
Graphs for Quantitative Variable Could group obs into ordered categories Categories based on Cut-points of interest Scale of the data Categories often called classes Graph called a histogram Constructing a Histogram 1 Need to first specify non-overlapping classes Want all observations to appear in a class Can specify number of classes or class width Classes usually of equal size or width Range= Largest obs - Smallest obs Number of classes = Range / Class width Number of classes n Other descriptive graphs include Will specify class by [a, b) (a x<b) Dot plot Stem and Leaf diagram 2 Count the number of obs in each class 2-8 2-9 Generating the Histogram Example Types of histograms Frequency Height of bar = # of occurrences in [a, b) Relative Frequency or Percent Height of bar = frequency/n Cumulative Frequency Height of bar = # of occurrences <b Density Histogram Area of bar = relative frequency Height of bar = relative frequency / class width Appropriate for unequal class widths Allows comparison with specific distributions 2-10 Collect daily number of bike accidents over a three-week period requiring urgent care Week Mon Tue Wed Thu Fri Sat Sun 1 4 3 2 5 6 8 2 2 4 1 5 3 2 7 4 3 5 2 3 3 1 4 7 Decide the class width to be 2 accidents Smallest obs = 1 and Largest obs = 8 (8-1)/2 = 3.5 so we need 4 classes Class Freq Rel. Freq Cum. Freq Density 1-3 6.286 3-5 8.381 5-7 4.190 7-9 3.143 Total 21 100 2-11
Stem-and-Leaf Display Histogram which retains data values Breaks each obs into stem and leaf Stem - ten s digit & Leaf - one s digit Stem - one s digit & Leaf - tenth s digit Can break stem down more if desired Bike Accident Example Stem Leaves 0 11222233334444 0 5556778 Stem-and-Leaf Display Can easily identify Typical value Gaps in the data Number and locations of peaks Presence of outlying values Extent of symmetry 2-12 2-13 Shape of Graphical Summary Examples Appropriate when variable ordered (quantitative) If only one mode or peak, defined as unimodal -6-4 -2 0 2 4 6-5 0 5 If two peaks defined as bimodal If similar on each side of middle, termed symmetric Skewed - one tail stretched out more than other -15-10 -5 0 5 10 15 0 5 10 15 2-14 2-15
Numerical Summaries Describe characteristics of distribution s shape Each summary known as a statistic Each statistic is also a variable Observed value depends on the sample Measures of center or location Examples: Mean and Median Measures of spread or dispersion Examples: Standard Deviation and Range 2-16 Mean Measures of Location Arithmetic average x = n i=1 x i/n Center of gravity or point of balance (x i x) =0 Median middle observation 50% of observations at or above median 50% of observations at or below median if n odd, median is the.5(n + 1) largest obs else, median is average of.5n and.5n+1 largest 2-17 Measures of Location Mode most frequent observation(s) Quantiles/Quartiles Divide data into groups using percentiles Quartiles divide data into four equal parts Trimmed Mean remove certain %-age of smallest/largest obs compute mean of remaining observations Bicycle Accident Data Ordered data set (smallest to largest) n = 21 observations 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 6 7 7 8 What are the measures of location? x = (1+1+2+... +7+8)/21 = 81/21 = 3.86 median: since n odd, middle obs (11th) = 4 mode: 2, 3 and 4 each occur 4 times decide on percentage prior to inspection of data 2-18 2-19
Visualizing Numerical Statistics Bicycle Accident Data If distribution symmetric, median = mean What if largest obs accidently recorded as 18? x = (1+1+2+...+7+18)/21 = 91/21 = 4.33 median: since n odd, middle obs (11th) = 4 mode: 2, 3 and 4 If distribution skewed, mean and median will be different Mean pulled more towards the longer tail Comparison of Measures of Center Resistance - insensitivity to changes in data set Mean more sensitive to extreme observations compared to median and trimmed mean Efficiency - ability to use all the information Median more resistant than mean Mean is more efficient 2-20 2-21 Measures of Spread Measures of Spread Range diff between the largest and smallest obs maximum - minimum Interquartile Range diff between the third and first quartiles Variance deviation is defined as x x mean of deviations is always zero compute average of squared deviations commonly divide by n 1 instead of n Standard deviation the square root of the variance measured in same units as observations spread of middle 50% of obs (x x) 2 Sxx s = = n 1 n 1 2-22 2-23
Bicycle Accident Data Ordered data set (smallest to largest) n = 21 observations 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 6 7 7 8 What are the measures of spread? range = 8 1=7 interquartile range = 5 2=3.25(21 + 1) = 5.5 Q 1 =(2+2)/2 =2.75(21 + 1) = 16.5 Q 3 =(5+5)/2 =5 Bicycle Accident Data Ordered data set (smallest to largest) n = 21 observations 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 6 7 7 8 What are the measures of spread? standard deviation: use computational formula x 2 ( x) 2 n S = n 1 x 2 =1+1+4+... + 64 = 391 s = (391 (81) 2 /21)/20 = 1.98 2-24 2-25 Another Example Compute the following numerical summaries Mean Median Range Standard Deviation IQR Life Expectancy in Specific Countries (source and year unknown) Life Country Expectancy Kenya 59 Japan 80 Singapore 76 Fiji 72 Germany 76 France 77 Switzerland 78 Taiwan 74 Canada 78 U.S. 77 New Zealand 76 Brunei 74 2-26 2-27
Box Plot Graphical summary of 5 statistics Modified Box Plot Minimum Maximum Lower whisker extends no further than 1st quartile Q 1 1.5IQR 3rd quartile Median Upper whisker extends no further than Box defined by quartiles (first and third) Q 3 +1.5IQR Median represented as line in box Observations outside fence displayed as dots Whiskers extended to min and max 2-28 2-29 Visualizing Statistics Examples Range roughly the width of histogram Common to look at # of obs within x ± ks For fairly symmetric unimodal distribution 68% of observation within ±s of the mean -6-4 -2 0 2 4 6-5 0 5 95% of observations within ±2s of the mean 99% of observations within ±3s of the mean Comparison of spread statistics Range and standard deviation very non-resistant Interquartile range resistant -15-10 -5 0 5 10 15 0 5 10 15 Standard deviation most efficient 2-30 2-31
Measure of Spread Relative to Location Coefficient of Variation Linear Transformation of a Variable Often spread increases with mean Spending relative to income Weight gain relative initial weight Response to dosage level Coefficient of Variation : CV = s/x s and x measured in same units CV is unit-less : ratio of spread:center Expresses std dev as percentage of mean Consider linear transformation of X: ax + b How do the numerical and graphical summaries change? Examples x x x (x x) 2 y y y (y y) 2 z z z (z z) 2 12-4 16 6-2 4 10-4 16 14-2 4 7-1 1 12-2 4 18 2 4 9 1 1 16 2 4 20 4 16 10 2 4 18 4 16 64 0 40 32 0 10 56 0 40 Y =.5X Z = X 2 y =8 x =16 z =14 s y = 10/3 s x = 40/3 s z = 40/3 Allows comparison of data sets with diff means 2-32 2-33 Changes? Linear Transformation of a Variable Measures of location Transform changes statistic just like variable Additive transformation (X Z) shifts each obs equal distance to left or right Y = ax + b Quartiles also change in similar manner distance between observations remains the same Multiplicative transformation (X Y ) distance between values increases or decreases change of scale Measures of spread Measure spread between observations Only multiplicative transform affects spread s Y = as X 2-34 2-35