Module 1: Review of Basic Statistical Concepts 1.2 Plotting Data, Measures of Central Tendency and Dispersion, and Correlation Constructing a Trend Plot A trend plot graphs the data against a variable of interest, t often time or space. It is used to examine whether or not there is a relationship between the variable being examined and time/space. Examples: Is the level of contamination in a well increasing over time? Is there a relationship between the measured concentration of a contaminant in a series of wells and their distance from a suspected source? Module 1.2 2 1
Constructing a Trend Plot b) Trend Plot Leve el of Contamination (pp 4 3 2 1 Jan-9 97 Jul-9 97 Jan-9 98 Jul-9 98 Jan-9 99 Jul-9 99 Jan- Jul- Jan- 1 Time Module 1.2 3 Constructing a Trend Plot Note that both axes are labeled Sometimes it makes sense to connect the points, sometimes not. Use your judgement. Do not Add a Trendline in Excel unless you have run a regression analysis and know that the slope of the line is significantly different from zero We ll learn how to do that later in the course. Module 1.2 4 2
Constructing a Scatter Plot A scatter plot graphs values of one variable against the values of another variable. It is used to see if there is a relationship between the two variables. Module 1.2 5 Constructing a Scatter Plot Scatter Plot ntaminant B (ppb) Con 15 1 5 5 15 25 35 Contaminant A (ppb) Module 1.2 6 3
Constructing a Histogram A histogram is a graph of a sample pdf. It is constructed by: Choose 5-1 non-overlapping intervals that cover the range of the data Figure out how many data points fall into each interval Divide the number of points in each interval by the total number of data points to get relative frequencies Plot the data range on the X axis and the relative frequency on the Y Draw a bar the width of each interval and the height of the relative frequency Module 1.2 7 Constructing a Histogram Histogram.3 Rela ative Frequency.25.2.15.1.5.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. Contaminant C (ppb) Module 1.2 8 4
Measures of Central Tendency Sample Mean Same as Average n 1 X Xi n i 1 Sample Median The middle value of a data set sorted from largest to smallest. If there are an even number of data points, average the two middle values. Sample Mode The most commonly occurring value Module 1.2 9 Measures of Central Tendency Example: Heights to the nearest inch 6 64 65 67 67 67 69 7 72 72 Mean = (6+64+65+67+67+67+69+7+72+72)/1 = 67.3 Median = (67 + 67)/2 = 67 Mode = 67 Module 1.2 1 5
Measures of Central Tendency Example: Salaries in a Start-up Dot Com company (in thousands) 27 27 33 35 85 15 Mean = 59.5K Median = 34K Mode = 27K So, for symmetric distributions ib i (like the normal) the mean is a good measure of central tendency but for skewed distributions (like income or environmental contamination) it is heavily influenced by a few unusual points. Module 1.2 11 Measures of Dispersion Measures of dispersion measure how spread out the data are. Sample Range = largest value smallest value The problem with the range is that it tells you nothing about all of the rest of the data and it s very affected by one odd point Module 1.2 12 6
Measures of Dispersion Intuitively, you can think of the Sample Standard Deviation as the average difference between the data points and the mean. Unlike the range, it s a function of all of the data points. A deviation is a difference between two values. We can easily calculate the deviations of each data point from the mean. If we summed these, we would get zero. So, we must either square them or take their absolute value. Absolute values are difficult to work with mathematically so we ll square the deviations. Then we average them to get the variance. Then, since we squared the deviations, the units of the variance are the square of the data points so we take the square root to get back to original units. Module 1.2 13 Measures of Dispersion Because of some mathematical properties of the statistic, we use n-1 rather than n in taking the average of the deviations. Sample Variance is s 2. Take the square root of it to get the sample standard deviation s. s 2 n 1 n 1 i 1 ( X i X ) 2 s 1 n 1 n i 1 ( X i X ) 2 Module 1.2 14 7
Measures of Dispersion Example: Heights to the nearest inch X i X i X i X 2 i i 6-7.3 53.29 64-3.3 1.89 65-2.3 5.29 67 -.3.9 67 -.3.9 67 -.3.9 69 1.7 2.89 7 2.7 7.29 72 4.7 22.9 72 4.7 22.9 124.1 Sample Mean is = 67.3 Sample Variance is s 2 = (1/(1-1))*124.1=(1/9)*124.1 = 13.79 Sample Standard deviation is the square root of 13.79 = 3.71 X Module 1.2 15 Percentiles of a Distribution The population median is the point that has 5% of the distribution above it and 5% below. The sample median has 5% of the data above and 5% below. The percentiles of the distribution (or sample) are similar. The Xth percentile has X percent of the distribution (or data) below it and 1-X percent above it. For example, a 95 percentile has 95% of the distribution below it and 5% above it. Module 1.2 16 8
Correlation The correlation coefficient measures the degree of linear association between two variables. It is denoted by r and ranges between -1 and 1. A perfect linear association gives points that plot on a straight line. No association gives points that plot as a cloud. A positive linear association means that high values of one variable are associated with high values of the other. A negative linear association means that high values of one variable are associated with low values of the other. Module 1.2 17 Examples of the Correlation Coefficient 9 8 7 6 r = -1 5 4 3 2 1 2 4 6 8 9 8 7 6 5 4 3 2 1 r = 1 2 4 6 8 Module 1.2 18 9
Examples of the Correlation Coefficient 9 8 7 r =.9 6 5 4 3 2 1 2 4 6 8 9 8 7 r = -.9 6 5 4 3 2 1 2 4 6 8 Module 1.2 19 Examples of the Correlation Coefficient 1 9 r = -.1 8 7 6 5 4 3 2 1 2 4 6 8 1 Module 1.2 2 1
Correlation Note it is the degree of linear, or straight line, association Variables can have strong associations and have very small correlations This association is strong but r= Module 1.2 21 11