Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Similar documents
Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Basics of Multivariate Modelling and Data Analysis

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Explaining Correlations by Plotting Orthogonal Contrasts

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

s e, which is large when errors are large and small Linear regression model

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Principal component analysis

Eigenvalues, Eigenvectors, and an Intro to PCA

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Application of mathematical, statistical, graphical or symbolic methods to maximize chemical information.

ReducedPCR/PLSRmodelsbysubspaceprojections

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

CSE 554 Lecture 7: Alignment

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Learning with Singular Vectors

Linear Methods for Regression. Lijun Zhang

FAST CROSS-VALIDATION IN ROBUST PCA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Probabilistic Latent Semantic Analysis

Vector Space Models. wine_spectral.r

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

SVD, PCA & Preprocessing

Machine learning for pervasive systems Classification in high-dimensional spaces

The prediction of house price

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

More about Single Factor Experiments

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Singular Value Decomposition

LECTURE NOTE #10 PROF. ALAN YUILLE

Accounting for measurement uncertainties in industrial data analysis

18.S096 Problem Set 7 Fall 2013 Factor Models Due Date: 11/14/2013. [ ] variance: E[X] =, and Cov[X] = Σ = =

Multivariate Statistical Analysis

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Dimensionality Reduction Techniques (DRT)

Bootstrapping, Randomization, 2B-PLS

Mathematical foundations - linear algebra

Inferential Analysis with NIR and Chemometrics

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Chemometrics. 1. Find an important subset of the original variables.

SOME APPLICATIONS: NONLINEAR REGRESSIONS BASED ON KERNEL METHOD IN SOCIAL SCIENCES AND ENGINEERING

Forecast comparison of principal component regression and principal covariate regression

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES

Intro to Linear Regression

Intro to Linear Regression

CS540 Machine learning Lecture 5

Lecture: Face Recognition and Feature Reduction

Simple Linear Regression

Least Squares Optimization

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Problems. Looks for literal term matches. Problems:

Principal Component Analysis

Lecture 6. Numerical methods. Approximation of functions

Statistics 910, #5 1. Regression Methods

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principal Component Analysis (PCA) Theory, Practice, and Examples

CS 143 Linear Algebra Review

1 Linearity and Linear Systems

Principal Component Analysis

Preprocessing & dimensionality reduction

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Problem # Max points possible Actual score Total 120

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Lecture: Face Recognition and Feature Reduction

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression.

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

RESPONSE SURFACE MODELLING, RSM

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Principal Components Analysis (PCA)

15 Singular Value Decomposition

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Inverse Theory. COST WaVaCS Winterschool Venice, February Stefan Buehler Luleå University of Technology Kiruna

STATISTICAL LEARNING SYSTEMS

Simple Linear Regression Using Ordinary Least Squares

EXTENDING PARTIAL LEAST SQUARES REGRESSION

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Linear Algebra Methods for Data Mining

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Regularization: Ridge Regression and the LASSO

Linear Regression Linear Regression with Shrinkage

Transcription:

Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University

Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references with concentrations c 1, c 2, c 3 and measure the absorbances A 1, A 2, and A 3 Place a straight line through the points: A = a + bc Measure the absorbance A of the unknown sample Read the concentration from the calibration curve

Linear regression Calibration curve A A calc (c) = a + bc We have good theoretical grounds for saying that the calibration model is linear. The intercept a is determined by additional disturbing components in the sample and can be ignodred. c

Linear regression How to solve One measured value, one regressor Equation: y calc (x) = b 0 + b 1 x To be solved: b 0, b 1 How to:

Multiregression Experiment Two components a and b Measure at two wavelengths, ë 1, and ë 2 Beer-Lamberts law but different for each component and wavelength;additive At ë 1 A 1 = å a1 c a + å b1 c b At ë 2 A 2 = å a2 c a + å b2 c b In matrix form

Multiregression Calibrate the model Obtain the unknown coefficients å In this case, two solutions containing only either a or b in known concentrations The molar absorption coefficients are calculated from A 1 ' = å a1 c a at ë 1 A 2 ' = å a2 c a at ë 2 A 1 " = å b1 c b at ë 1 A 2 " = å b2 c b at ë 2

Linear regression again Standard method One dimension y = b 0 + b 1 x = b 0 1 + b 1 x Generalization of the linear model

ANOVA In linear regression Definitions Sum of Squares Matrix operation Calculation D.f. SS T, Total SS M, Mean SS corr, Corrected for the mean SS fact, Factors SS R, Residuals SS lof, Lack of fit SS pe, Pure experimental error n = observations; p = coefficients b; f = replications

ANOVA Example No. x y 1 0 0.3 2 1 2.2 3 2 3 4 2 4

ANOVA Sums of squares

ANOVA Quality of the fit Correlation Mean sum of squares: MSS = SS divided by D.f. F-test for goodness of fit If this F-value exceeds the critical value in the table the fit is significant. F-test for lack-of-fit This F-value cannot exceed the critical value if the model is appropriate.

ANOVA The example Goodness of fit Exceeds the critical value at 5 % risk, 18.51. The fit is statistically significant Lack of fit This value is below the critical value at 5 % risk, 161. The model is appropriate because the lack of significant is not significant.

ANOVA Confidence intervals Variance-covariance matrix For an appropriate fit with a low value of SS lof, MSS R = s R 2 can be used instead of Ss pe. The diagonal elements of the variance-covariance matrix are the variances of the factors b. The confidence limits of a factor b (either b 0 or b 1 ) are The prediction at a given point x 0 = (1 x 0 ) is

ANOVA The example At 5 % risk level, F(0.05; 1, 2) = 18.51. Thus we obtain

Multiple linear regression Ordinary regression Two-dimensional case Measurement at three points (x 11,x 12 ), (x 21,x 22 ) and (x 31,x 32 ) are needed In matrix form this system of equations is written as The order has been changed to stress similarity to 1-dim regression

Multiple linear regression Ordinary regression If there are several dependent variables y, each with a different equation m p m n Y = n X B +Residuals p

Multiple linear regression Ordinary regression The equation is solved exactly as the 1-dim equation However, if there are linear dependencies between the x s the system becomes singular and cannot be solved. Prediction a y 0 vector (dimension 1xm) at a given point x 0 (dimension 1xp) is

PCR Principal component regression In full multicomponent regression it often happens that some of the x s are interdependent, i.e., not linearly independent. To avoid this only a few coordinate axes are used. They are chosen to be orthogonal. The selection of orthogonal coordinates resembles the PCA method.

PCR PCA revisited In PCA, the original data matrix X is written as a product of the scores and loadings, One method of solving the problem is to use the SVD (singular value decomposition) method. In that case X matrix is written as Here matrix U corresponds to T and V corresponds to L. They are joined by a diagonal matrix W. The diagonal elements are w ii = ë ii, where ë ii are the eigenvalues of the X matrix. The smallest eigenvalues can be forced to value zero. Then this matrix will remove the small eigenvalues indicating dependencies.

PCR The SVD method Solution of the full linear equation is Now the matrix X is written in SVD approximation as Then a pseudo-inverse matrix X + can be used The solution will then be given along a desired number of principal axes as

PLS Partial Least Squares method In PCA the matrix X is split into a product of the scores matrix and the loadings matrix, p d p p X = T P T + d + E n n n

PLS Partial Least Squares method In PLS, also the matrix Y is split into a product of the scores matrix and the loadings matrix, m d m m Y = U Q T + d + F n n n

PLS Partial Least Squares method The solution can then be written as Here W is is a dxp matrix of PLS weights. Only a few of the eigenvalues are kept, the rest are set to zero.

PLS Algorithm Initialize: Shift the columns of the matrices. Initialize: Use the first column of the Y matrix as the first Y score vector.

PLS Algorithm (1) Compute X-weights (2) Scale the weights (3) Estimate the scores of the X matrix

PLS Algorithm (4) Compute the Y loadings (5) Generate a new u vector Repeate from step (1) until u is stationary.

PLS Algorithm (6) Determine the scalar coefficient b for this variable. (7) Compute the loadings of the X matrix. (8) Compute the residuals.

PLS Algorithm Stopping criterion: Calculate the standard error of prediction due to crossvalidation. If SEP CV is greater than the actual number of factors then the optimum number of dimensions has been reached and the final B coefficients can be calculated. Otherwise use the residuals from step (8) as the new X and Y matrices and continue from the initialization step with an additional dimension.