Statistical Analysis of Chemical Data Chapter 4

Random errors arise from limitations on our ability to make physical measurements and on natural fluctuations

Histogram: (Bar Graph) Normal Curve: (Line Graph) 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9

Data that vary because of random errors only will be normally distributed around a mean value. The distribution of random data around the mean is characterized by a Gaussian Distribution. Characteristics: Bell-shaped Center: Mean = Median = Mode Standard Deviation width of the distribution

SAMPLE vs. POPULATION Population is a set of entities concerning which statistical inferences are to be drawn. Sample is the subset of a manageable size of population. Statistics calculated from the sample are used to infer or extrapolate about the population. Population Sample

SAMPLE vs. POPULATION Population Mean ( ) - mean of entire population Sample Mean (x) mean of a given sample 200 180 160 140 120 100 80 60 40 30 25 20 15 10 5 20 0 0 1 2 3 4 5 6 7 8 9-3 -2.5-2 -1.5-1 -0.5 0 0.5 1 1.5 2 2.5 When N (usually 20-30) is big x When N is small there is a bigger deviation between x and

SAMPLE vs. POPULATION Population Standard Deviation ( ) measures the width of distribution of a population Sample Standard Deviation (s) applicable to finite samples When N (usually 20-30) is big s When N is small there is a bigger deviation between s and

SAMPLE vs. POPULATION

Standard Deviation and Probability

Confidence Interval Confidence Interval (CI) is a range of values within which there is a specified probability of finding the true mean If you only take a single measurement in a population then that single measurement will have a confidence interval of: If you take a lot of measurements, the mean of all the measurements will have a confidence interval of: µ = x ± zσ µ = x ± zσ n **NOTE: This is for cases where there is a good estimate of the population standard deviation (s ) or it is known.

Confidence Intervals

Confidence Intervals There is an incorrect notion of confidence interval: Given the true value and a specified confidence interval, the measurements will fall within in this interval at a certain probability The correct notion is that, Given the sample/population mean and a specified confidence interval, the true mean will fall in this confidence interval at a certain probability

Confidence Intervals Student s t is a statistical tool used to express confidence intervals We use t when we don t know the population standard deviation, the confidence interval can be estimated as:

Confidence Intervals Student s t is a statistical tool used to express confidence intervals We use t when we don t know the population standard deviation, the confidence interval can be estimated as: µ = x ± ts n

Hypothesis testing employs Student s t statistics Student s t can be used to compare two sets of measurements to decided whether they are the same or different CASE 1: Comparing measured value to theoretical value CASE 2: Comparing replicate sets of measurements (with different means and standard deviations) CASE 3: Comparing paired data

Hypothesis testing employs Student s t statistics CASE 1: Comparing measured value to theoretical value We measure a quantity several times, obtaining an average value and a standard deviation. We need to compare our answer with a known, accepted answer. The average does not agree exactly with the accepted answer. Does our measured answer agree or disagree with the known value within experimental error? Null Hypothesis (H 0 ): x = 0 Use the t-statistic: Alternative Hypothesis (H a ): x 0 (two-tailed) if t calc = x µ s n t calc t table or t calc t table x < 0 (one-tailed) if x > 0 (one-tailed) if t calc t table t calc t table

Hypothesis testing employs Student s t statistics CASE 1: Comparing measured value to theoretical value EXAMPLE 1. A new procedure for the rapid determination of the percentage of sulfur in kerosene was tested on a sample known from its method of preparation to contain 0.123% Sulfur. The results were % S= 0.112, 0.118, 0.115 and 0.119. Do the data indicate that there is a bias in the method at the 95% confidence interval?

Hypothesis testing employs Student s t statistics CASE 1: Comparing measured value to theoretical value EXAMPLE 2. Sewage and industrial pollutants dumped into a body of water can reduce the dissolved oxygen concentration and adversely affect aquatic species. In one study, weekly readings are taken from the same location in a river over a 2-month period (see table). Some scientists think that 5.0 ppm is a dissolved O 2 level that is marginal for fish to live. Conduct a statistical test to determine whether the mean dissolved O 2 concentration is less than 5.0 ppm at 95% confidence level. Week Dissolved O 2, ppm 1 4.9 2 5.1 3 5.6 4 4.3 5 4.7 6 4.9 7 4.5 8 5.1

Hypothesis testing employs Student s t statistics CASE 2: Comparing replicate measurements We measure a quantity multiple times by two different Methods that give two different answers, each with its own standard deviation. Do the two results agree with each other within experimental error, or do they disagree? Null Hypothesis (H 0 ): x 1 = x 2 Use the t-statistic: t calc = x 1 x 2 s pooled n 1 n 2 n 1 + n 2 Alternative Hypothesis (H a ): 0 if t calc s pooled = s 2 1 n 1 1 t table ( ) + s 2 ( 2 n 2 1) n 1 + n 2 2

Hypothesis testing employs Student s t statistics CASE 2: Comparing replicate measurement EXAMPLE 3. Lord Rayleigh measured the mass of dry air (O 2 -free) and chemically generated N 2 of the same volume. Is dry air the same as chemically generated N 2? From air (g) 2.31017 2.30143 2.30986 2.29890 2.31010 2.29816 2.31001 2.30182 2.31024 2.29869 2.31010 2.29940 2.31028 2.29849 Average From chemical composition (g) 2.29889 2.31011 2.29947 Standard Deviation 0.000143 0.00138

Hypothesis testing employs Student s t statistics CASE 2: Comparing replicate measurement EXAMPLE 4. A reliable assay of ATP in a certain type of cell gives a value of 111.0 mol/100 ml, with a standard deviation of 2.8 in four replicate measurements. You have developed a new assay which gave the following values in replicate analyses: 117, 119. 111, 115, 120 mol/100 ml a). Find the mean and standard deviation of your new analysis b). Can you be 95% confident that your method produces a result different from the reliable value?

Hypothesis testing employs Student s t statistics CASE 3: Comparing paired data Sample 1 is measured once by Method A and once by Method B, which do not give exactly the same result. Then a different sample, designated as sample 2, is also measured once by Method A and once by Method B; and again the results are not exactly equal. The procedure is repeated for n different samples. Do the two methods agree with each other within experimental error, or is one systematically different from the other? Null Hypothesis (H 0 ): d = 0 often times 0 = 0 Use the t-statistic: Alternative Hypothesis (H a ): 0 (two-tailed) if t calc = d 0 s d n t calc t table or t calc t table

Hypothesis testing employs Student s t statistics CASE 3: Comparing paired data EXAMPLE 5. A new automated procedure for determining glucose in serum (Method A) is to be compared with an established method (Method B). Both methods are performed on the serum from six patients to eliminate patient-to-patient variability. Do the following results confirm a difference in the two methods at 95% confidence level? 1 2 3 4 5 6 Method A, mg/l 1044 720 845 800 957 650 Method B, mg/l 1028 711 820 795 935 639 Difference, mg/l 16 9 25 5 22 11

Hypothesis testing employs Student s t statistics CASE 3: Comparing paired data EXAMPLE 6. Two different analytical methods were used to determine residual chlorine in sewage effluents. Both methods were used on the same samples, but each sample came from various locations, with differing amounts of contact time. The concentration of Cl in mg/l are given in the Table. Do the two methods give different results for 90%, 95% and 99% confidence levels? Sample Method A Method B 1 0.39 0.36 2 0.84 1.35 3 1.76 2.56 4 3.35 3.92 5 4.69 5.35 6 7.70 8.33 7 10.52 10.70 8 10.92 10.91

Dealing with BAD DATA Bad data are due to GROSS ERRORS, and result in outliers. We use Q test to determine whether we can reject or we need to retain an outlier. Q = x (questionable data ) x nearest neighbor spread Q calc > Q table Discard data EXAMPLE 7. The analysis of calcite sample yielded % CaO of 55.95, 56.00, 56.04, 56.08 and 56.23. The last value appears anomalous; should it be retained or discarded at 95% confidence level?