Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that can be assigned a numerical value or nonnumerical category. Data itself and its transformed forms are also called statistics. Types of variables: 1. Categorical Variable, it records a category subject belongs to, like Blood Type (O, A, B, AB) or Gender (Female, Male). Usually categories do not have a meaningful order. Some categorical data can be ordinal, where some natural order exists for example: response to the treatment: none, partial, complete. 2. Quantitative (Numeric) Variable, records amount of something or a count of something. It can be continuous,with values on the continuous scale (Weight of a newborn, Cholesterol content in a blood specimen) or discrete, where values can be listed, often values are integer (Number of eggs in the nest, Number of bacteria in a petri dish). Distinction between discrete and continuous variables is not rigid, we often round up measurements to nearest integer Sample=collection of persons or things on which we measure one or more variables. Sometimes that same word is used in a different context (for example sample of blood taken from a subject). To avoid confusion we will say a specimens of blood in that case. Some other vocabulary and notation: Example. Twenty students gave reported their gender, blood type and weight to a researcher. Students are here observational units. Variables are Gender, Blood Type ( both categorical) and Weight (numerical). Sample size is n=20 We will use capital letters like X and Y for the names of the variables and lower case letters (x or y) for the particular observations. For example we may use Y=weight of a student and y 1 =150 lb as a weight of one such a student (John). 2.2. Frequency distributions. When data is collected, to make sense of it it is helpful to summarize it in a form of tables and/or graphs. We will use some example data sets to examine different ways data can be displayed. Ex1: Sample of Blood Type for 21 people: A O A AB O B AB A O A O AB O A O B A AB A O A We can summarize it using frequency and relative frequency table. Frequency=count in a particular class. Relative frequency=frequency/n % frequency= relative frequency*100%
Frequency table results for Blood Type: Blood Type Frequency Relative Frequency A 8 0.3809524 AB 4 0.1904762 B 2 0.0952381 O 7 0.33333334 Graphical display includes a Bar Chart. Notice that classes do not have to be placed in any particular order. Ex2 40 couples, # of children in each family 3 3 3 1 4 3 0 0 2 0 4 2 4 3 2 2 3 2 5 1 1 0 1 1 2 1 0 0 1 2 1 1 0 3 2 1 2 1 2 3 These data can be grouped using a single value, since there are relatively few different data values. Our classes will be in order: 0,1,2,3,4,5, frequencies will be computed exactly as in example #1.
Frequency table results for Number of children: Number of children Frequency Relative Frequency 0 7 0.175 1 11 0.275 2 10 0.25 3 8 0.2 4 3 0.075 5 1 0.025 Graphical display of such a data is called a histogram, bars will be raised with classes placed in the middle of each bar. Another way to display such a data is a dotplot. You place a dot over each data value. If values are repeated, you place multiple dots equally spaced above these values. Grouped frequency distribution is appropriate for a data set with a lot of different values like in the following example. Ex3 AGE of onset of diabetes (35 people) 48 41 57 83 41 55 59 61 38 48 79 75 77 7 54 23 47 56 79 68 61 64 45 53 82 68 38 70 10 60 83 76 21 65 47 If we decide to start at 0 and have groups with the width=10 we can have following classes: [0,10), [10,20), [20,30) and so on, Treat the notation like an interval notation. Histogram for these data can also be obtain, bars will be raised over each class. Vertical axis can represent either frequency or relative frequency.
We can also obtain a fast histogram, otherwise called stem-and-leaf diagram (or a stemplot): Each data point is divided into stem and leaf, all possible stems are placed vertically and leaves are added to them in order. Our stemplot is given below, notice that leaves are ordered. 0 7 1 0 2 1 3 3 8 8 4 1 1 5 7 7 8 8 5 3 4 5 6 7 9 6 0 1 1 4 5 8 8 7 0 5 6 7 9 9 8 2 3 3 Ex4 Radishes growth (mm in 3 days) A(in the dark) B (12 hours of light/ 12 hours of dark) A: 15 20 22 20 29 37 11 35 15 30 8 25 33 10 B: 10 11 15 15 20 4 22 21 10 25 27 20 9 20 Side by side Stemplots (with 2 leaves per stem) can let us compare both sets: In both stems are tens, leaves are ones 0 4 8 0 9 1 0 1 0 0 1 5 5 1 5 5 A 2 0 0 2 0 0 0 1 2 B 9 5 2 5 7 3 0 3 7 5 3 Interpreting areas of the histogram: Area of each bar of the histogram is proportional to corresponding frequency. In example #3 area between 10 and 20 (2 bars) equals 3/35~8.6% of the total area of the histogram Ex5 The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 30 females under the age of 51 15.0 18.1 14.4 14.6 10.9 18.1 18.2 18.3 15.0 16.0 12.6 16.6 20.7 19.8 11.6 12.8 15.6 11.0 15.3 9.4 19.5 18.3 14.5 16.6 11.5 16.4 12.5 14.6 11.9 12.5
In that last example we may select groups of width 2, namely: [9,11), [11,13), [13,15) and so on, we will get 6 classes, appropriate number for data of 30 observations. Shapes of Distributions. right skewed distribution, left skewed distribution, symmetric distribution, 2.3 Descriptive Measures of Center Let Y be our variable, numerical. y = Median=middle of the ordered data. Position (location) of the median is n=sample size. Ex Weight gain in pounds for 6 young lambs n+ 1 2, where 1 2 10 11 13 19, 0.5(6+1)=3.5 (median is between observation #3 and #4), y =(10+11)/2=10.5 lb If we add one more observation: 10lb, data becomes: 1 2 10 10 11 13 19, 0.5(7+1)=4,(median is observation #4) y =10 Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion of the data. y = Mean (arithmetic mean)= n i=1 y= n y i, where y i -s are observations in the sample. In our example y =56/6~9.33 lb
Differences between each data point and the mean and their sum i=1 n ( y i y)=0 for any data set. ( y i y) are called deviations from the mean In our example sum of all deviations=-8.33+ (-7.33)+.67+1.67+3.67+9.67=0 Mean can be visualized as a point of balance of the weightless seesaw with points (like children) sitting on it. Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If data has some extreme values then median is a better measure of center for that data. Mean vs Median For symmetric distributions mean and medial are equal, if distribution is left skewed, mean<median, if distribution is right skewed mean>median. 2.4 Boxplots. Single variable data may be summarized by 5 numbers: Minimum, Maximum, Median and 2 Quartiles referred to as five-number summary. These values are also used to make a box plot. Lower quartile denoted by Q 1 is a median of lower half of data, upper quartile denoted by Q 3 is a median of upper half of data. Ex1 Data represents systolic blood pressure (in mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Min=113, Max=170, Median=132 Q 1 =124 Q 3 =151 (Median is excluded when we compute quartiles) Boxplot connects all 5 numbers in the following way, the box represents middle half of the data. 110 120 130 140 150 160 170
Another measure we can compute is Interquartile Range IQR= Q 1 - Q 3. This measure gives spread of middle half of data values. We can use it to find unusual data points (outliers). The procedure is as follows: Compute lower fence=q 1-1.5*IQR and upper fence=q 3 + 1.5*IQR. An outlier is a data point that falls outside of the fences. In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower fence=124-40.5=83.5, upper fence= 151+40.5 = 191.5, all observations are within the fences, so so there are no outliers in our data set. Ex2 Radishes growth (in mm) in the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Min=4, Max=21, Q 1 =7, Median=(9+10)/2=9.5 Q 3 =10 IQR=3, lower fence=2.5 upper fence=14.5, so 20 and 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 2.5 Relationship between variables. This section discusses various ways used to compare two or more variables. Some methods include: a) Two way frequency and relative frequency tables to examine relationship between two categorical variables. They are useful to determine if variables are associated or not. b) Scatter plots for numerical variables to decide if there is a linear trend present, so that we can fit a regression line to the data. c) Side-by-side boxplots, dot plots, stemplots are useful to observe if there are differences between two or more treatments.
2.6 Measures of dispersion (variability) Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to extreme data values. IQR as we stated before gives range of the middle half of data and is a robust measure, not sensitive to extreme data values. Sample standard deviation n (y i y ) 2 averages the squared deviations from the mean. s = i=1 n 1 Square root is taken at the end, so the units of s are the same as the units of the data. s 0, s=0 if all data points are the same s 2 is the sample variance. We will abbreviate SD for standard deviation, s will be used in the formulas. Ex. Experiment on chrysanthemums, botanist measured stem elongation in 7 days (in mm) 76, 72, 65, 70, 82 n=5 y=365 /5=73, deviations from the mean are: 3, -1,-8,-3,9, squared deviations are: 9, 1,64,9,81 s= (9+ 1+ 64+ 9+ 81)/4 = 164/ 4 =6.40 mm variance s 2 =41mm 2 s gives typical distance of the observations from the mean, larger s means more variability. Similar to the mean, s is also influenced by extreme data values (not a robust measure). n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider n=1, when variability of 1 observation can't be computed, one data point gives no information about variability. The Coefficient of Variation = s expressed as a percentage of the mean: coefficient of variation= units, for example: s y 100% has no units and can be used to compare data sets with different EX Weight and height is measured for girls at age 2. Which of the two measures has greater variability? Weight : mean=12.6 kg, SD=1.4 kg Height: mean=86.6 cm, SD=2.9 cm coef. of variation: 11.1% for weight and 3.3% for height, we conclude that weight is more variable, here SD is much larger percentage of the mean than for height.
Typical Percentages: The Empirical Rule For a nice distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to find : about 68% of all data points within the interval ( y SD, y+ SD) about 95% of all data points within the interval ( y 2SD, y+ 2SD) more than 99% of all data points within the interval ( y 3SD, y+ 3SD) 2.8. Statistical Inference is the process of drawing conclusions about the population based on the observations in the sample. We can for example estimate percentage of all people in England with blood type A as 44% (the sample proportion of people with that blood type). Sample must be considered a random sample from entire population, must be representative of that population. 44% is a statistics (sample proportion p= y n, p hat ) that is estimating a parameter of the population (population proportion p). There are also other statistics we can use to estimate a population proportion, namely p= y+ 2, p tilde. n+ 4 In each case y=number of people in a sample that have a blood type A, n=sample size. We will discuss these estimates in later chapters Other parameters of the population that we often estimate from the samples are: population mean, μ, is estimated by sample mean, y. population SD, σ, is estimated by sample SD, s.