Experimental design Matti Hotokka Department of Physical Chemistry Åbo Akademi University
Contents Elementary concepts Regression Validation Hypotesis testing ANOVA PCA, PCR, PLS Clusters, SIMCA Design of Experiments [1] Wonnacott & Wonnacott: Introductory statistics, Wiley [2] Snedecor & Cochran: Statistical Methods, Iowa State Univ. Press [3] Otto, Chemometrics, Wiley
Hypotesis testing Inference method Confidence levels Descriptive statistics Hypotesis testing Predictive statistics
Hypotesis testing Steps involved Formulate a null hypotesis This is what you want to claim E.g., the sample is within tolerances Formulate an alternative hypotesis This is a complement to null hypotesis E.g., the sample is not within tolerances Calculate a characteristic number Compare with tabulated values Accept or reject the null hypotesis
Hypotesis testing Huge number of test exist Tests for mean Tests for distribution Tests for spread Tests for outliers Etc.
Hypotesis testing Test for the mean Double-sided t-test, x = x µ t = s P(X) n Acceptable No-no No-no x
Hypotesis testing Mean at a nominal value The ibuprofen concentration must be 400 mg per pill. Therefore 400 mg. Take 5 pills and measure the ibuprofen content. The results are 396, 388, 398, 382, 373 mg. Mean x = 387 mg, s = 10.3 mg. Calculate the critical number, t = 2.82 Degrees of freedom = n-1 = 4 Choose risk level: 5 % (95 % confidence) Read the table for Student s t-test at risk level 0.025 because the risk 2.5 % at the low end and 2.5 % at the high end gives total risk of 5 %. The value in the table, 2.776, is smaller than the calculated one. Reject the null hypotesis. Accept the alternative hypotesis. We cannot guarantee at 95 % confidence level that the pills have the precribed amount of ibuprofen. =
Student s distribution D.f. Risk 0.05 0.025 0.0125 1 6.314 12.706 25.452 2 2.920 4.303 6.205 3 2.353 3.182 4.176 4 2.132 2.776 3.495 5 2.015 2.571 3.163 10 1.812 2.228 2.634 15 1.753 2.131 2.490 20 1.725 2.086 2.423 1.6448 1.9600 2.2414 N = number of samples D.f. = degrees of freedom = N - 1 This table is one-sided. Therefore the total risk at level 0.025 is 2.5 % + 2.5 % and confidence probability is 95 %.
Hypotesis testing Test for the mean One-sided t-test, x = x t = µ s P(X) n Acceptable No-no x
Hypotesis testing Mean below a nominal value The EU regulatory limit for nitrate in drinking water is 50 mg/l. Determinations from 4 parallel samples gave the results 51.0, 51.3, 51.6, 50.9 mg/l. Is this just random variation or is the observed level systematically above the prescribed limit? Mean 51.2 and st.dev. 0.316 mg/l. Null hypotesis: the level is not exceeded, x it is too high. Calculate t = 7.59. Choose risk level: 5 %. D.f. = 4-1 = 3., alternative hypotesis: The tabulated value of t, 2.353, is smaller than the calculated one. The null hypotesis must be rejected. The concentration is too high.
Hypotesis testing Compare two means Compare two sets of parallel measurements from different samples. Do the two samples differ significantly? A two-sided test. t = x s d x n n n + n 1 2 1 2 1 2 s d = 2 ( n1 1) s1 + ( n 1) s n + n 2 1 2 2 2 2 D. f.= n1 + n2 2
Hypotesis testing Do two production batches differ? Quality control tests the day and night shifts at a refinery. The octan numbers of parallel measurements are (1: day) 94.92, 95.07, 94.96, 95.02, 94.99, 94.93; (2: nite) 95.03, 95.08, 94.98, 95.03, 95.01, 94.99. Means: (1) 94.98; (2) 95.02 St.dev.: (1) 0.057; (2) 0.036 Weighted st.dev. = 0.048 Student s t = 1.443 d.f. = 10 Choose risk level 2.5 %, read column 0.0125: t = 2.634 Comparison: No, we cannot say that the two results differ. Therefore only random variations are observed.
Q 1 Hypotesis testing = Dixon s Q test for outliers Can be applied also for very few observations. Arrange your n observations in ascending order. Calculate the numbers Q 1 and Q n. Null hypotesis: not an outlier. Accepted if calculated Q less than tabulated. x x n x x 2 1 1 ; Q 2 = xn x n xn x 1 1
Hypotesis testing Dixon s Q test for outliers Critical values of Q test at the 1 % risk level. Number of observations = n. n Q n Q 3 0.99 11 0.50 4 0.89 12 0.48 5 0.76 13 0.47 6 0.70 14 0.45 7 0.64 15 0.44 8 0.59 20 0.39 9 0.56 30 0.34 10 0.53
Hypotesis testing Dixon s test for outliers Personer i följande åldrar deltar i en bussresa till teater i Helsingfors: 6, 7, 5, 6, 7, 6, 103, 8, 7, 5. Order them: 5, 5, 6, 6, 6, 7, 7, 7, 8, 103. Q 1 = 0, 5 is not an outlier; Q 2 = 0.969, 103 certainly is an outlier.
Hypotesis testing Grubb s test for outliers Observation x* is not an outlier in a series if T = x s x * < T Tabulated
Hypotesis testing Grubb s outlier test Critical values for Grubb s outlier test at 95 % and 99 % levels. Number of observ ations = n. n T(95%) T(99%) n T(95%) T(99%) 3 1.15 1.16 10 2.18 2.41 4 1.46 1.49 12 2.29 2.55 5 1.67 1.75 15 2.41 2.71 6 1.82 1.94 20 2.56 2.88 7 1.94 2.10 30 2.75 3.10 8 2.03 2.22 40 2.87 3.24 9 2.11 2.32 50 2.96 3.34 10 2.18 2.41
Hypotesis testing Outliers in linear regression In order to find whether or not observation k (value y k ) is an outlier 1) Calculate a new regression with observation k removed. 2) Calculate e k = y k obs - y k calc.
ANOVA Analysis of variance Used to test interdependences between batches. Used as an analysis tool for designed experiments. Requires several parallel measurements (replicates) of each batch (or experiment).
Anova One-way analysis Assume that four different samples are taken from waste water of a factory to study the potassium concentration (mg/l). Each sample is analysed by a different crues. Three parallel measurements are made to determine the concentration of each sample. Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53
Anova Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489
Anova Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 4 SSQ = n ( y y ) 2 fact j j total j= 1
Anova Variation within samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 (y ij - y j) 2
Anova Variation within samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 (y ij - y j) 2 SSQ R = 0.260 4 n SSQ = ( y y ) 2 j R ij j j= 1 i= 1
Anova Total variation Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749
Anova Total variation Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749 q n SSQ = ( y y ) 2 j corr ij total j= 1 i= 1
Anova Total variation broken down to contributions Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749 SSQ corr = SSQ fact + SSQ R
PCA Principal component analysis PCA finds a direction along which the points lie. X = Ca 2 3 4 ph 1 2 3 Features Object!!!
PCA What does it mean? y=ph Principal component (1 1) x=ca
PCA What is it? PCA classifies the observations. It does not perform any regression. X = 2 " $ $ $ # 3 4 1 2 3 Low % ' ' ' & Medium High
PCA What does it mean? y=ph Principal component (1 1) x=ca
PCA Next principal component The next principal direction with the next largest spread must be orthogonal to the first one.
PCA The second principal component y=ph Principal component (1-1) Principal component (1 1) x=ca
( PCA How is it done? Direction of largest spread needs to be found. Spread along the coordinate axes is given by the variance-covariance matrix. Its eigenvalue gives the characteristic spread. The corresponding eigenvector gives the direction. Eigenvectors are automatically orthogonal. So, diagonalize the Only, you don t. matrix.
PCA How is it done? Diagonalization gives ALL eigenvalues. You only need a few largest. Use special mathematical techniques instead.
PCA Eigenvalues The spread of the first component is largest, that of the second smaller etc. Two or three components usually explain all the spread down to experimental errors. Eigenvalue = Spread These do not differentiate the observations. 1 2 3 4 5 Component
PCA How is it done, then? Break down the observations X to a product of a scores matrix T and a loadings matrix L. X = 2 ) + + + * 3 4 1 2 3,... - T = 2 / 1 1 1 0 3 4 Scores 1 2 3 2 4 4 4 3 L T = ( 1 1) Loadings X = T L T Compare: y = ax This example is mathematically inconsistent!
PCA Loadings The loadings matrix tells what is the direction of the principal component. y=ph Principal component, L T = (1 1) x=ca
PCA Scores The scores matrix tells where the points lie along the new coordinate axis. y=ph x=ca
PCA A real case Hairs samples from a crime site were analyzed. The following elemental compositions of the hairs of the suspects were detected. Hair Cu Mn Cl Br I 1 9.2 0.30 1730 12.0 3.6 2 12.4 0.39 930 50.0 2.3 3 7.2 0.32 2750 65.3 3.4 4 10.2 0.36 1500 3.4 5.3 5 10.1 0.50 1040 39.2 1.9 6 6.5 0.20 2490 90.0 4.6 7 5.6 0.29 2940 88.0 5.6 8 11.8 0.42 867 43.1 1.5 9 8.5 0.25 1620 5.2 6.2
PCA Scores Consider two principal components: PC2 4 9 1 8 2 3 6 7 5 PC1
PCA Loadings Loadings tell how much the original variables contribute to the principal component: PC2 I Mn 0 Cu Mn Br I Cl Cu Cl 0 PC1 0 Br PC1
PCR Principal component regression (Multivariate) linear regression along the principal components. Only one (or a few) variable(s). Maximal resolving power.
5 PLS Partial least squares Linear regression: y = x a Multivariate regression: y = X a y = X a PLS: y = U Q y = U Q
Cluster analysis Cluster analysis finds observations that are more similar to each other than to observations outside the cluster.
Cluster analysis Distance Cluster analysis is based on distance (or similarity) between objects. City-block distance Euklidian distance Pearson-distance Mahalonobis distance...
Cluster analysis City-block distance Feature 2 x 22 d 12 = x 11 -x 21 + x 12 -x 22 x 12 -x 22 x 12 x 11 -x 21 x 11 x 21 Feature 1
Cluster analysis Euklidian distance Feature 2 x 22 d 12 = [(x 11 -x 21 ) 2 + (x 12 -x 22 ) 2 ] 1/2 x 12 x 11 x 21 Feature 1
Cluster analysis Pearson-distance d ij = K k = 1 ( x x ) ik s 2 j jk 2
Cluster analysis Example data Concentrations of calcium and phosphate in six blood serum samples (mg per 100 ml). Object Features Calcium Phosphate 1 8.0 5.5 2 8.25 5.75 3 8.7 6.3 4 10.0 3.0 5 10.25 4.0 6 9.75 3.5 d 12 = [(8.0-8.25) 2 + (5.5-5.75) 2 ] 1/2 = 0.354
Cluster analysis Distance matrix Object 1 2 3 4 5 6 1 0 Smallest distance 2 0.354 0 3 1.063 0.711 0 4 3.201 3.260 3.347 0 5 2.704 2.658 2.774 1.031 0 6 2.658 2.704 2.990 0.559 0.707 0 1* 1 2 3 4 5 6
Cluster analysis Second distance matrix Object 1* 3 4 5 6 1* 0 3 1.774 0 4 3.231 3.347 0 5 2.681 2.774 1.031 0 6 2.681 2.990 0.559 0.707 0 1* 4* 1 2 3 4 5 6
Cluster analysis Third distance matrix Object 1* 3 4* 5 1* 0 3 1.774 0 4* 2.956 3.169 0 5 2.681 2.774 0.869 0 1* 4* 5* 1 2 3 4 5 6
Cluster analysis Fourth distance matrix Object 1* 3 5* 1* 0 3 1.774 0 5* 2.819 2.972 0 3* 1* 4* 5* 1 2 3 4 5 6