Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Similar documents
Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

STAT 501 EXAM I NAME Spring 1999

Chapter 4: Factor Analysis

Principal component analysis

Comparing Several Means: ANOVA

Unconstrained Ordination

Statistics: Error (Chpt. 5)

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal Component Analysis

What is Principal Component Analysis?

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Machine Learning 2nd Edition

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Noise & Data Reduction

Statistical Analysis of Chemical Data Chapter 4

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Machine learning for pervasive systems Classification in high-dimensional spaces

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

PCA and admixture models

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Basics of Multivariate Modelling and Data Analysis

Noise & Data Reduction

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Inferences for Regression

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Bootstrapping, Randomization, 2B-PLS

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Formal Statement of Simple Linear Regression Model

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Correlation & Simple Regression

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal component analysis, PCA

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Multivariate Regression

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Explaining Correlations by Plotting Orthogonal Contrasts

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Econometrics. 4) Statistical inference

Weighted Least Squares

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010

COMPARING SEVERAL MEANS: ANOVA

MEMORIAL UNIVERSITY OF NEWFOUNDLAND DEPARTMENT OF MATHEMATICS AND STATISTICS FINAL EXAM - STATISTICS FALL 1999

R = µ + Bf Arbitrage Pricing Model, APM

Math 423/533: The Main Theoretical Topics

One-way ANOVA. Experimental Design. One-way ANOVA

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Using SPSS for One Way Analysis of Variance

MULTIVARIATE ANALYSIS OF VARIANCE

Analysis of Variance. ภาว น ศ ร ประภาน ก ล คณะเศรษฐศาสตร มหาว ทยาล ยธรรมศาสตร

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Practical Statistics for the Analytical Scientist Table of Contents

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Statistical Tools for Multivariate Six Sigma. Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.

Minimum Error Rate Classification

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

5. Discriminant analysis

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

TAMS38 Experimental Design and Biostatistics, 4 p / 6 hp Examination on 19 April 2017, 8 12

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

THE PEARSON CORRELATION COEFFICIENT

G E INTERACTION USING JMP: AN OVERVIEW

Chapter 12 - Lecture 2 Inferences about regression coefficient

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Hypothesis Testing hypothesis testing approach

BIOSTATISTICAL METHODS

4.1 Hypothesis Testing

Correlation and Simple Linear Regression

7. Variable extraction and dimensionality reduction

Review of Statistics 101

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Multiple Regression Analysis

Chapter 5 Introduction to Factorial Designs Solutions

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Principal Components Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Experimental Design and Data Analysis for Biologists

Confidence Interval for the mean response

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Ross (1976) introduced the Arbitrage Pricing Theory (APT) as an alternative to the CAPM.

Group comparison test for independent samples

Factor Analysis and Kalman Filtering (11/2/04)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Analysis of Variance (ANOVA)

Stat 217 Final Exam. Name: May 1, 2002

Transcription:

Experimental design Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Contents Elementary concepts Regression Validation Hypotesis testing ANOVA PCA, PCR, PLS Clusters, SIMCA Design of Experiments [1] Wonnacott & Wonnacott: Introductory statistics, Wiley [2] Snedecor & Cochran: Statistical Methods, Iowa State Univ. Press [3] Otto, Chemometrics, Wiley

Hypotesis testing Inference method Confidence levels Descriptive statistics Hypotesis testing Predictive statistics

Hypotesis testing Steps involved Formulate a null hypotesis This is what you want to claim E.g., the sample is within tolerances Formulate an alternative hypotesis This is a complement to null hypotesis E.g., the sample is not within tolerances Calculate a characteristic number Compare with tabulated values Accept or reject the null hypotesis

Hypotesis testing Huge number of test exist Tests for mean Tests for distribution Tests for spread Tests for outliers Etc.

Hypotesis testing Test for the mean Double-sided t-test, x = x µ t = s P(X) n Acceptable No-no No-no x

Hypotesis testing Mean at a nominal value The ibuprofen concentration must be 400 mg per pill. Therefore 400 mg. Take 5 pills and measure the ibuprofen content. The results are 396, 388, 398, 382, 373 mg. Mean x = 387 mg, s = 10.3 mg. Calculate the critical number, t = 2.82 Degrees of freedom = n-1 = 4 Choose risk level: 5 % (95 % confidence) Read the table for Student s t-test at risk level 0.025 because the risk 2.5 % at the low end and 2.5 % at the high end gives total risk of 5 %. The value in the table, 2.776, is smaller than the calculated one. Reject the null hypotesis. Accept the alternative hypotesis. We cannot guarantee at 95 % confidence level that the pills have the precribed amount of ibuprofen. =

Student s distribution D.f. Risk 0.05 0.025 0.0125 1 6.314 12.706 25.452 2 2.920 4.303 6.205 3 2.353 3.182 4.176 4 2.132 2.776 3.495 5 2.015 2.571 3.163 10 1.812 2.228 2.634 15 1.753 2.131 2.490 20 1.725 2.086 2.423 1.6448 1.9600 2.2414 N = number of samples D.f. = degrees of freedom = N - 1 This table is one-sided. Therefore the total risk at level 0.025 is 2.5 % + 2.5 % and confidence probability is 95 %.

Hypotesis testing Test for the mean One-sided t-test, x = x t = µ s P(X) n Acceptable No-no x

Hypotesis testing Mean below a nominal value The EU regulatory limit for nitrate in drinking water is 50 mg/l. Determinations from 4 parallel samples gave the results 51.0, 51.3, 51.6, 50.9 mg/l. Is this just random variation or is the observed level systematically above the prescribed limit? Mean 51.2 and st.dev. 0.316 mg/l. Null hypotesis: the level is not exceeded, x it is too high. Calculate t = 7.59. Choose risk level: 5 %. D.f. = 4-1 = 3., alternative hypotesis: The tabulated value of t, 2.353, is smaller than the calculated one. The null hypotesis must be rejected. The concentration is too high.

Hypotesis testing Compare two means Compare two sets of parallel measurements from different samples. Do the two samples differ significantly? A two-sided test. t = x s d x n n n + n 1 2 1 2 1 2 s d = 2 ( n1 1) s1 + ( n 1) s n + n 2 1 2 2 2 2 D. f.= n1 + n2 2

Hypotesis testing Do two production batches differ? Quality control tests the day and night shifts at a refinery. The octan numbers of parallel measurements are (1: day) 94.92, 95.07, 94.96, 95.02, 94.99, 94.93; (2: nite) 95.03, 95.08, 94.98, 95.03, 95.01, 94.99. Means: (1) 94.98; (2) 95.02 St.dev.: (1) 0.057; (2) 0.036 Weighted st.dev. = 0.048 Student s t = 1.443 d.f. = 10 Choose risk level 2.5 %, read column 0.0125: t = 2.634 Comparison: No, we cannot say that the two results differ. Therefore only random variations are observed.

Q 1 Hypotesis testing = Dixon s Q test for outliers Can be applied also for very few observations. Arrange your n observations in ascending order. Calculate the numbers Q 1 and Q n. Null hypotesis: not an outlier. Accepted if calculated Q less than tabulated. x x n x x 2 1 1 ; Q 2 = xn x n xn x 1 1

Hypotesis testing Dixon s Q test for outliers Critical values of Q test at the 1 % risk level. Number of observations = n. n Q n Q 3 0.99 11 0.50 4 0.89 12 0.48 5 0.76 13 0.47 6 0.70 14 0.45 7 0.64 15 0.44 8 0.59 20 0.39 9 0.56 30 0.34 10 0.53

Hypotesis testing Dixon s test for outliers Personer i följande åldrar deltar i en bussresa till teater i Helsingfors: 6, 7, 5, 6, 7, 6, 103, 8, 7, 5. Order them: 5, 5, 6, 6, 6, 7, 7, 7, 8, 103. Q 1 = 0, 5 is not an outlier; Q 2 = 0.969, 103 certainly is an outlier.

Hypotesis testing Grubb s test for outliers Observation x* is not an outlier in a series if T = x s x * < T Tabulated

Hypotesis testing Grubb s outlier test Critical values for Grubb s outlier test at 95 % and 99 % levels. Number of observ ations = n. n T(95%) T(99%) n T(95%) T(99%) 3 1.15 1.16 10 2.18 2.41 4 1.46 1.49 12 2.29 2.55 5 1.67 1.75 15 2.41 2.71 6 1.82 1.94 20 2.56 2.88 7 1.94 2.10 30 2.75 3.10 8 2.03 2.22 40 2.87 3.24 9 2.11 2.32 50 2.96 3.34 10 2.18 2.41

Hypotesis testing Outliers in linear regression In order to find whether or not observation k (value y k ) is an outlier 1) Calculate a new regression with observation k removed. 2) Calculate e k = y k obs - y k calc.

ANOVA Analysis of variance Used to test interdependences between batches. Used as an analysis tool for designed experiments. Requires several parallel measurements (replicates) of each batch (or experiment).

Anova One-way analysis Assume that four different samples are taken from waste water of a factory to study the potassium concentration (mg/l). Each sample is analysed by a different crues. Three parallel measurements are made to determine the concentration of each sample. Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53

Anova Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489

Anova Variation between samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 4 SSQ = n ( y y ) 2 fact j j total j= 1

Anova Variation within samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 (y ij - y j) 2

Anova Variation within samples Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 (y ij - y j) 2 SSQ R = 0.260 4 n SSQ = ( y y ) 2 j R ij j j= 1 i= 1

Anova Total variation Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749

Anova Total variation Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749 q n SSQ = ( y y ) 2 j corr ij total j= 1 i= 1

Anova Total variation broken down to contributions Replicate Batch 1 2 3 4 1 10.2 10.6 10.3 10.5 2 10.4 10.8 10.4 10.7 3 10.0 10.9 10.7 10.4 Mean 10.20 10.77 10.47 10.53 y total = 10.49 SSQ fact = 0.489 SSQ R = 0.260 SSQ corr = 0.749 SSQ corr = SSQ fact + SSQ R

PCA Principal component analysis PCA finds a direction along which the points lie. X = Ca 2 3 4 ph 1 2 3 Features Object!!!

PCA What does it mean? y=ph Principal component (1 1) x=ca

PCA What is it? PCA classifies the observations. It does not perform any regression. X = 2 " $ $ $ # 3 4 1 2 3 Low % ' ' ' & Medium High

PCA What does it mean? y=ph Principal component (1 1) x=ca

PCA Next principal component The next principal direction with the next largest spread must be orthogonal to the first one.

PCA The second principal component y=ph Principal component (1-1) Principal component (1 1) x=ca

( PCA How is it done? Direction of largest spread needs to be found. Spread along the coordinate axes is given by the variance-covariance matrix. Its eigenvalue gives the characteristic spread. The corresponding eigenvector gives the direction. Eigenvectors are automatically orthogonal. So, diagonalize the Only, you don t. matrix.

PCA How is it done? Diagonalization gives ALL eigenvalues. You only need a few largest. Use special mathematical techniques instead.

PCA Eigenvalues The spread of the first component is largest, that of the second smaller etc. Two or three components usually explain all the spread down to experimental errors. Eigenvalue = Spread These do not differentiate the observations. 1 2 3 4 5 Component

PCA How is it done, then? Break down the observations X to a product of a scores matrix T and a loadings matrix L. X = 2 ) + + + * 3 4 1 2 3,... - T = 2 / 1 1 1 0 3 4 Scores 1 2 3 2 4 4 4 3 L T = ( 1 1) Loadings X = T L T Compare: y = ax This example is mathematically inconsistent!

PCA Loadings The loadings matrix tells what is the direction of the principal component. y=ph Principal component, L T = (1 1) x=ca

PCA Scores The scores matrix tells where the points lie along the new coordinate axis. y=ph x=ca

PCA A real case Hairs samples from a crime site were analyzed. The following elemental compositions of the hairs of the suspects were detected. Hair Cu Mn Cl Br I 1 9.2 0.30 1730 12.0 3.6 2 12.4 0.39 930 50.0 2.3 3 7.2 0.32 2750 65.3 3.4 4 10.2 0.36 1500 3.4 5.3 5 10.1 0.50 1040 39.2 1.9 6 6.5 0.20 2490 90.0 4.6 7 5.6 0.29 2940 88.0 5.6 8 11.8 0.42 867 43.1 1.5 9 8.5 0.25 1620 5.2 6.2

PCA Scores Consider two principal components: PC2 4 9 1 8 2 3 6 7 5 PC1

PCA Loadings Loadings tell how much the original variables contribute to the principal component: PC2 I Mn 0 Cu Mn Br I Cl Cu Cl 0 PC1 0 Br PC1

PCR Principal component regression (Multivariate) linear regression along the principal components. Only one (or a few) variable(s). Maximal resolving power.

5 PLS Partial least squares Linear regression: y = x a Multivariate regression: y = X a y = X a PLS: y = U Q y = U Q

Cluster analysis Cluster analysis finds observations that are more similar to each other than to observations outside the cluster.

Cluster analysis Distance Cluster analysis is based on distance (or similarity) between objects. City-block distance Euklidian distance Pearson-distance Mahalonobis distance...

Cluster analysis City-block distance Feature 2 x 22 d 12 = x 11 -x 21 + x 12 -x 22 x 12 -x 22 x 12 x 11 -x 21 x 11 x 21 Feature 1

Cluster analysis Euklidian distance Feature 2 x 22 d 12 = [(x 11 -x 21 ) 2 + (x 12 -x 22 ) 2 ] 1/2 x 12 x 11 x 21 Feature 1

Cluster analysis Pearson-distance d ij = K k = 1 ( x x ) ik s 2 j jk 2

Cluster analysis Example data Concentrations of calcium and phosphate in six blood serum samples (mg per 100 ml). Object Features Calcium Phosphate 1 8.0 5.5 2 8.25 5.75 3 8.7 6.3 4 10.0 3.0 5 10.25 4.0 6 9.75 3.5 d 12 = [(8.0-8.25) 2 + (5.5-5.75) 2 ] 1/2 = 0.354

Cluster analysis Distance matrix Object 1 2 3 4 5 6 1 0 Smallest distance 2 0.354 0 3 1.063 0.711 0 4 3.201 3.260 3.347 0 5 2.704 2.658 2.774 1.031 0 6 2.658 2.704 2.990 0.559 0.707 0 1* 1 2 3 4 5 6

Cluster analysis Second distance matrix Object 1* 3 4 5 6 1* 0 3 1.774 0 4 3.231 3.347 0 5 2.681 2.774 1.031 0 6 2.681 2.990 0.559 0.707 0 1* 4* 1 2 3 4 5 6

Cluster analysis Third distance matrix Object 1* 3 4* 5 1* 0 3 1.774 0 4* 2.956 3.169 0 5 2.681 2.774 0.869 0 1* 4* 5* 1 2 3 4 5 6

Cluster analysis Fourth distance matrix Object 1* 3 5* 1* 0 3 1.774 0 5* 2.819 2.972 0 3* 1* 4* 5* 1 2 3 4 5 6