Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Similar documents
Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

Inferences for Regression

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

4.1 Hypothesis Testing

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Sociology 6Z03 Review II

Design of Engineering Experiments Part 2 Basic Statistical Concepts Simple comparative experiments

Tables Table A Table B Table C Table D Table E 675

Difference in two or more average scores in different groups

Chap The McGraw-Hill Companies, Inc. All rights reserved.

Two-Sample Inferential Statistics

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

One-Way Analysis of Variance. With regression, we related two quantitative, typically continuous variables.

Comparing Several Means: ANOVA

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Mathematical Notation Math Introduction to Applied Statistics

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

PLSC PRACTICE TEST ONE

Statistical Analysis of Chemical Data Chapter 4

Battery Life. Factory

Data analysis and Geostatistics - lecture VII

OHSU OGI Class ECE-580-DOE :Design of Experiments Steve Brainerd

Econometrics. 4) Statistical inference

Multiple comparisons - subsequent inferences for two-way ANOVA

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Statistics: Error (Chpt. 5)

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

9 One-Way Analysis of Variance

In a one-way ANOVA, the total sums of squares among observations is partitioned into two components: Sums of squares represent:

Analysis of Variance (ANOVA)

An Old Research Question

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Analysis of Variance

Factorial designs. Experiments

Sleep data, two drugs Ch13.xls

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

(Re)introduction to statistics: dusting off the cobwebs

Lec 5: Factorial Experiment

Introduction to the Analysis of Variance (ANOVA)

STA 101 Final Review

Introduction. Chapter 8

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

Formal Statement of Simple Linear Regression Model

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

Hypothesis Testing hypothesis testing approach

Lecture 3: Inference in SLR

Review of Statistics 101

ST4241 Design and Analysis of Clinical Trials Lecture 4: 2 2 factorial experiments, a special cases of parallel groups study

P-values and statistical tests 3. t-test

Chapter 16. Simple Linear Regression and Correlation

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Statistics: CI, Tolerance Intervals, Exceedance, and Hypothesis Testing. Confidence intervals on mean. CL = x ± t * CL1- = exp

ANOVA: Analysis of Variation

Fractional Factorial Designs

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

Week 12 Hypothesis Testing, Part II Comparing Two Populations

Inferences about central values (.)

We need to define some concepts that are used in experiments.

Descriptive Statistics

Statistics for EES Factorial analysis of variance

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

Analysis of Variance: Part 1

Notes for Week 13 Analysis of Variance (ANOVA) continued WEEK 13 page 1

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

PLS205 Lab 2 January 15, Laboratory Topic 3

One-sided and two-sided t-test

Analysis of Variance. Read Chapter 14 and Sections to review one-way ANOVA.

Statistical Analysis of Engineering Data The Bare Bones Edition. Precision, Bias, Accuracy, Measures of Precision, Propagation of Error

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Correlation Analysis

Independent Samples t tests. Background for Independent Samples t test

Chapter 23: Inferences About Means

Stat 231 Final Exam. Consider first only the measurements made on housing number 1.

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Ch 11- One Way Analysis of Variance

Design of Experiments. Factorial experiments require a lot of resources

Chapter 9 Inferences from Two Samples

ANOVA CIVL 7012/8012

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Ch. 1: Data and Distributions

Correlation 1. December 4, HMS, 2017, v1.1

DESAIN EKSPERIMEN BLOCKING FACTORS. Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Chapter 16. Simple Linear Regression and dcorrelation

Quantitative Techniques - Lecture 8: Estimation

Confidence Intervals, Testing and ANOVA Summary

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

PROBLEM TWO (ALKALOID CONCENTRATIONS IN TEA) 1. Statistical Design

Inference for Regression Simple Linear Regression

Unit 27 One-Way Analysis of Variance

Transcription:

Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University

Hypothesis testing Inference method Confidence levels Descriptive statistics Hypotesis testing Predictive statistics

Hypothesis testing The steps involved Formulate a null hypotesis This is what you want to claim E.g., the sample is within tolerances Formulate an alternative hypotesis This is a complement to null hypotesis E.g., the sample is not within tolerances Calculate a characteristic number Compare with tabulated values Accept or reject the null hypotesis

Hypothesis testing Huge number of tests exist Tests for mean Tests for distribution Tests for spread Tests for outliers Etc.

Hypothesis testing Test for the mean Double-sided t-test, x = ì P(X) Acceptable No-no No-no x

Hypothesis testing Mean at nominal value (double-sided) The ibuprofen concentration must be 400 mg per pill. Therefore ì = 400 mg. Take 5 pills and measure the ibuprofen content. The results are 396, 388, 398, 382, 373 mg. Mean x = 387 mg, s = 10.3 mg. Calculate the critical number, t = 2.82 Degrees of freedom = n-1 = 4 Choose risk level: 5 % (95 % confidence) Read the table for Student s t-test at risk level 0.025 because the risk 2.5 % at the low end and 2.5 % at the high end gives total risk of 5 %. The value in the table, 2.776, is smaller than the calculated one. Reject the null hypotesis. Accept the alternative hypotesis. We cannot guarantee at 95 % confidence level that the pills have the prescribed amount of ibuprofen.

Student s distribution Reminder D.f. Risk 0.05 0.025 0.0125 1 6.314 12.706 25.452 2 2.920 4.303 6.205 3 2.353 3.182 4.176 4 2.132 2.776 3.495 5 2.015 2.571 3.163 10 1.812 2.228 2.634 15 1.753 2.131 2.490 20 1.725 2.086 2.423 1.6448 1.9600 2.2414 N = number of samples D.f. = degrees of freedom = N - 1 This table is one-sided. Therefore the total risk at level 0.025 is 2.5 % + 2.5 % and confidence probability is 95 %.

Hypothesis testing Test for the mean One-sided t-test, x = ì P(X) Acceptable ì No-no x

Hypothesis testing Mean below a nominal value (one-sided) The EU regulatory limit for nitrate in drinking water is 50 mg/l. Determinations from 4 parallel samples gave the results 51.0, 51.3, 51.6, 50.9 mg/l. Is this just random variation or is the observed level systematically above the prescribed limit? Mean 51.2 and st.dev. 0.316 mg/l. Null hypotesis: the level is not exceeded, x ì, alternative hypotesis: it is too high. Calculate t = 7.59. Choose risk level: 5 %. D.f. = 4-1 = 3. The tabulated value of t, 2.353, is smaller than the calculated one. The null hypotesis must be rejected. The concentration is too high.

Hypothesis testing Compare two means Compare two sets of parallel measurements from different samples. Do the two samples differ significantly? A two-sided test.

Hypothesis testing Do two production batches differ? Quality control tests the day and night shifts at a refinery. The octan numbers of parallel measurements are (1: day) 94.92, 95.07, 94.96, 95.02, 94.99, 94.93; (2: nite) 95.03, 95.08, 94.98, 95.03, 95.01, 94.99. Means: (1) 94.98; (2) 95.02 St.dev.: (1) 0.057; (2) 0.036 Weighted st.dev. = 0.048 Student s t = 1.443 d.f. = 10 Choose risk level 2.5 %, read column 0.0125: t = 2.634 Comparison: No, we cannot say that the two results differ. Therefore only random variations are observed.

Hypothesis testing Dixon s Q test for outliers Can be applied also for very few observations. Arrange your n observations in ascending order. Calculate the numbers Q 1 and Q n. Null hypotesis: not an outlier. Accepted if calculated Q less than tabulated.

Hypothesis testing Dixon s Q test for outliers Critical values of Q test at the 1 % risk level. Number of observations = n. n Q n Q 3 0.99 11 0.50 4 0.89 12 0.48 5 0.76 13 0.47 6 0.70 14 0.45 7 0.64 15 0.44 8 0.59 20 0.39 9 0.56 30 0.34 10 0.53

Hypothesis testing Dixon s Q test for outliers Persons of the following ages participate in a bus tour to see a theater performance in Helsinki: 6, 7, 5, 6, 7, 6, 103, 8, 7, 5. Order them: 5, 5, 6, 6, 6, 7, 7, 7, 8, 103. Q 1 = 0, 5 is not an outlier; Q n = 0.969, 103 certainly is an outlier.

Hypothesis testing Grubb s test for outliers Observation x* is not an outlier in a series if

Hypothesis testing Grubb s test for outliers Critical values for Grubb s outlier test at 95 % and 99 % levels. Number of observations = n. n T(95%) T(99%) n T(95%) T(99%) 3 1.15 1.16 10 2.18 2.41 4 1.46 1.49 12 2.29 2.55 5 1.67 1.75 15 2.41 2.71 6 1.82 1.94 20 2.56 2.88 7 1.94 2.10 30 2.75 3.10 8 2.03 2.22 40 2.87 3.24 9 2.11 2.32 50 2.96 3.34 10 2.18 2.41

Hypothesis testing Outliers in linear regression In order to find whether or not observation k (value y k ) is an outlier 1) Calculate a new regression with observation k removed. 2) Calculate e k = y k obs - y k calc. 3) Reject if distance exceeds a set limit; the limit is often two or three standard deviations

ANOVA Analysis of variance Used to test interdependences between batches. Used as an analysis tool for designed experiments. Requires several parallel measurements (replicates) in each batch (or experiment).

ANOVA One-way analysis Assume that samples are taken at four different times from waste water of a factory to study the potassium concentration (mg/l). Each sample is analysed by a different crue. Three parallel measurements are made to determine the concentration of each sample. Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 Do the true concentrations ì 1, ì 2, ì 3 and ì 4 differ?

ANOVA Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.494

ANOVA Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 Number of batches SSQ fact = 0.494 Number of replicates

ANOVA Variation between samples Yeah, well the variance is 0.494. So what? The variance of the means must be related to the general fluctuations in the whole set of measurements. Calculate the spread in batch 1, 2 etc and combine. Then there must be a test to see whether or not a critical value at a given risk level is exceeded.

ANOVA Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.494 SSQ R = 0.260 Ó(y ij - y j ) 2

ANOVA Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.494 SSQ R = 0.260 Ó(y ij - y j ) 2

ANOVA Variation between samples Calculate the variance of the means (á is the number of batches) Calculate the pooled variance of the whole set Form the test quantity

ANOVA Variation between samples Compare the calculated F value with the tabulated values to find out how probable it is that the two variances differ so much. Degrees of freedom: Numerator: á - 1 = 4-1 = 3 Denominator: á(n - 1) = 4(3-1) = 8 The F value is beyond the critical value F 0.01 = 7.59 so there is a less than 1 % chance that the concentrations are the same. Blow the whistle.

F distribution Df for denom Df for num 1 2 3 4 F 0.25 1.81 2.00 2.05 F 0.10 4.54 4.32 4.19 F 0.05 7.71 6.94 6.59 F 0.01 21.2 18.0 16.7 F 0.001 74.1 61.3 56.2 6 F 0.25 1.62 1.76 1.78 F 0.10 3.78 3.46 3.29 F 0.05 5.99 5.14 4.76 F 0.01 13.7 10.9 9.78 F 0.001 35.5 27.0 23.7 8 F 0.25 1.54 1.66 1.67 F 0.10 3.46 3.11 2.92 F 0.05 5.32 4.46 4.07 F 0.01 11.3 8.65 7.59 F 0.001 25.4 18.5 15.8 Critical value E.g., F 0.05 means that there is a 5 % probability variances differ this much. Choose the correct df s and read down the column until you reach a probability level that matches your critical value.

ANOVA One-way analysis Another view on F: Of all possible sources of difference, only the difference of the batches is considered, hence one-way. The model is the average of all batches, 10.49. The spread explained by the model is the fluctuation of individual batch averages from the overall average. This is s X2. The unexplained part is the fluctuation of the individual observations from the corresponding batch averages. This is measured by s p2.

ANOVA Two-way analysis There is another factor influencing the fluctuations besides the difference in batches, namely the difference in determinations in individual replications. In a two-way analysis both are considered.

ANOVA Two-way analysis Replicate Batch 1 2 3 4 Replicate means 1 10.2 10.6 10.3 10.5 10.40 2 10.4 10.8 10.4 10.7 10.57 3 10.0 10.9 10.7 10.4 10.50 Mean 10.20 10.77 10.47 10.53 10.49 The one-way model predicts the overall mean, 10.49. The two-way analysis predicts for observation batch=2, replicate=1 the overall mean plus a correction for batch mean (10.77-10.49=0.28) plus a correction for the replica mean (10.40-10.49=-0.09). Thus the model predicts the value 10.49 + 0.28-0.09 = 10.7. The residual fluctuation is now 0.01.

ANOVA Two-way analysis Replicate Batch 1 2 3 4 X (X j -X ) 2 1 10.2 10.6 10.3 10.5 10.40 0.0084 2 10.4 10.8 10.4 10.7 10.57 0.0069 3 10.0 10.9 10.7 10.4 10.50 0.0000 X i 10.20 10.77 10.47 10.53 10.49 0.0154 (X i -X ) 2 0.0851 0.0756 0.0006 0.0017 0.1631 #Replicates SS A = 3*0.1631 = 0.4892 Differences between batches, Df = 4-1 = 3 SS B = 4*0.0154 = 0.0617 Differences between replicates, Df = 3-1 = 2 #Batches

ANOVA Two-way analysis Residuals Replicate Batch 1 2 3 4 1 0.092-0.075-0.075 0.058 2 0.117-0.050-0.150 0.083 3-0.208 0.125 0.225-0.142 Predicted value Residual SS R = 0.1983, Df = 3*2 = 6

ANOVA Two-way analysis For the batches For the replications For the residuals For the batches For the replications

ANOVA Two-way analysis The critical value in the table shows that the probability that the batches are similar is less much than 1 %. The probability that the replicates are similar is roughly 25 %.