Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Similar documents
Correspondence. A Method for Evaluating Data-Preprocessing Techniques for Odor Classification with an Array of Gas Sensors

Table of Contents. Multivariate methods. Introduction II. Introduction I

Dynamic-Inner Partial Least Squares for Dynamic Data Modeling

FERMENTATION BATCH PROCESS MONITORING BY STEP-BY-STEP ADAPTIVE MPCA. Ning He, Lei Xie, Shu-qing Wang, Jian-ming Zhang

L11: Pattern recognition principles

A Least Squares Formulation for Canonical Correlation Analysis

Machine Learning 11. week

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Principal component analysis

ECE 661: Homework 10 Fall 2014

Introduction to Machine Learning

Basics of Multivariate Modelling and Data Analysis

Vector Space Models. wine_spectral.r

Statistical Machine Learning

Principal Component Analysis!! Lecture 11!

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

CITS 4402 Computer Vision

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Pattern Recognition 2

Pattern Recognition and Machine Learning

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Accounting for measurement uncertainties in industrial data analysis

Principal Component Analysis

Issues and Techniques in Pattern Classification

COMP 408/508. Computer Vision Fall 2017 PCA for Recognition

Principal Component Analysis

STATISTICAL SHAPE MODELS (SSM)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Linear Dimensionality Reduction

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Face recognition Computer Vision Spring 2018, Lecture 21

Principal Component Analysis (PCA)

Lecture Notes 1: Vector spaces

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Computation. For QDA we need to calculate: Lets first consider the case that

STUDY ON METHODS FOR COMPUTER-AIDED TOOTH SHADE DETERMINATION

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Maximum variance formulation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Motivating the Covariance Matrix

Principal Components Analysis (PCA)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Template-based Recognition of Static Sitting Postures

Announcements (repeat) Principal Components Analysis

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

7. Variable extraction and dimensionality reduction

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

Machine Learning 2nd Edition

Data Preprocessing Tasks

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Principal component analysis, PCA

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Non-parametric Classification of Facial Features

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Machine learning for pervasive systems Classification in high-dimensional spaces

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Course in Data Science

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Towards a Ptolemaic Model for OCR

CS 195-5: Machine Learning Problem Set 1

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dreem Challenge report (team Bussanati)

Advanced Introduction to Machine Learning CMU-10715

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Matching the dimensionality of maps with that of the data

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Basics of Multivariate Modelling and Data Analysis

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Principal Component Analysis, A Powerful Scoring Technique

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Forecasting 1 to h steps ahead using partial least squares

Lecture 3: Pattern Classification

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Dimensionality Reduction

Statistical Pattern Recognition

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Machine Learning for Software Engineering

Principal component analysis

Principal Component Analysis (PCA) Theory, Practice, and Examples

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Example: Face Detection

Methods for sparse analysis of high-dimensional data, II

Myoelectrical signal classification based on S transform and two-directional 2DPCA

An Introduction to Multivariate Methods

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Transcription:

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares R Gutierrez-Osuna Computer Science Department, Wright State University, Dayton, OH 45435, USA Abstract. The transient response of metal-oxide sensors exposed to mild odours can be oftentimes highly correlated with the behaviour of the array during the preceding wash and reference cycles. Since wash/reference gases are virtually constant overtime, variations in their transient response can be used to estimate the amount of sensor drift present in each experiment. We perform canonical correlation analysis and partial least squares to find a subset of latent variables that summarize the linear dependencies between odour and wash/reference responses. Ordinary least squares regression is then used to subtract these latent variables from the odour response. Experimental results on an odour database of four cooking spices, collected on a 0-sensor array over a period of three months, show significant improvements in predictive accuracy.. Introduction Besides selectivity and sensitivity constraints, sensor drift constitutes the main limitation of current gas sensor array instruments []. In order to improve baseline stability, it is customary to expose the sensor array to a sequence of wash and reference gases prior to sampling a target odour. Since wash/reference (W/R) gases can be assumed to remain stable over time, variations in sensor response during these stages may be utilized, at no extra cost, as measures of the drift associated with each experiment and, therefore, compensated for in software. The instrument employed in this study is an array of ten commercial metal-oxide sensors. During operation, the array is initially washed for ten seconds with room air bubbled through a two-per-cent n-butanol dilution in order to flush residues from previous odour samples. This wash stage is followed by a three-minute reference stage with charcoal-filtered air, which provides a baseline for the subsequent ninety-second odour stage, in which the array is finally exposed to the target odour. Figure illustrates a typical sensor transient response obtained with this procedure. To preserve information from the sensor dynamics, we apply the windowed time slicing (WTS) technique [2, 3] to compress each exponential-shaped transient down to four weighted integrals, as shown in the lower portion of Figure. With ten sensors, this compression technique yields 40-dimensional feature vectors x W, x R and y for the wash, reference and odour transients, respectively. x W and x R are subsequently combined to form a regression vector x with 80 dimensions.

Wash Reference Odour Sensor response (V) WTS Kernels 3 2.5 2.5 50 00 50 200 250 Time (s) x W x R y x Figure. Extraction of window-time-slicing features from the wash, reference and odour stages 2. Drift reduction approach Whereas variance in the W/R vector x is primarily due to drift, variance in y is a result of the combined effect of drift and the odour-specific interaction with the sensing material. Therefore, we seek to remove variance in y that can be explained by (correlated with) x while preserving odour-related variance. We propose a drift reduction algorithm that consists of three steps:. Find linear projections ~ x = Ax and ~ y = By that are maximally correlated: { A, B} = arg max[ ρ( Ax, By) ] () Vectors ~ x and ~ y are low-dimensional projections that summarize the linear dependencies between x and y. 2. Find a regression model y = W ~ y using Ordinary Least Squares (OLS): pred OLS W OLS = argmin y Wy (2) The OLS prediction vector y pred contains the variance in the odour vector y that can be explained by ~ y and, indirectly, by the W/R vector x. 3. Deflate y and use the residual z as a drift-reduced odour vector for classification purposes: z = y y pred = y WOLS By (3) To find the projection matrices A and B for the first step of the algorithm we employ Canonical Correlation Analysis () and Partial Least Squares (.) is a multivariate statistical technique closely related to Principal Components Analysis (PCA.) Whereas PCA finds the directions of maximum variance of each separate x- or y-space, identifies directions where x and y co-vary. It can be shown [4] that the columns of the projection matrices A and B are the eigenvectors of: ~ 2 2

( I xx xy yy yx ) a = 0 ( yy yx xx xy, ) b = 0 where { xx, yy } are the within-space covariance matrices and { xy, yx } are the betweenspace covariance matrices. The first pair of latent variables (canonical variates) ~ x = ax and ~ y = b y corresponds to the eigenvectors a and b associated with the largest eigenvalue of (4). This eigenvalue λ, identical for both equations, is the squared canonical correlation coefficient between ~ x and ~ y. Subsequent canonical variates are associated with decreasing eigenvalues and are uncorrelated with previous variates. will find as many pairs of canonical variates as the minimum of P and Q, the dimensionality of x and y, respectively. Alternatively, one may use to find the canonical variates (scores) x~ k and y~ k. It must be noted that, whereas finds all the eigenvector pairs (loadings) a k and b k simultaneously, operates sequentially, extracting one pair of loadings/scores at a time using a NIPALS algorithm [5]. Deflation of x and y in is also performed sequentially, after the extraction of each pair of loadings/scores. Furthermore, maximizes the correlation coefficient between each pair ( x~ k, y~ k ) while takes into consideration both correlation and variance in x and y [6]. Both and versions of the drift-reduction algorithm have been implemented and are reported in this article. (4) 3. Preliminary analysis The proposed drift-reduction method is evaluated on an odour database of four spices (ginger, red pepper, cumin and cloves) containing 374 examples collected during 24 different days over a period of three months. The array consists of ten metal-oxide sensors from Capteur (models AA4, AA20, G5, AA25, G7, LG9 and LG0) and Figaro (models 2620, 260 and 26.) We perform a preliminary analysis by feeding the entire database to the drift-reduction algorithm. The resulting correlation coefficients for each pair of latent variables ρ ( ~ x k, ~ y k ) are shown in Figure 2. This result indicates that is able to reduce correlations between x and y with fewer components than, a reasonable result since performs deflation sequentially, after extraction of each latent variable. The figure also suggests that there are 0 major canonical variates (interestingly, the same as the number of sensors), although Bartlett s sequential test [4] indicates that there are 2 statistically significant variates. Correlation Coefficient 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 Statistically significant 5 0 5 20 25 30 35 40 "Latent Variable" Figure 2. Correlation coefficient ( ~ x, ~ ) ρ for each pair of latent variables k y k 3

To illustrate the performance of the algorithm, we select the first 0 latent variables and plot the trajectory of the four WTS features over time before and after drift reduction. Figure 3 shows these trajectories for the AA20 sensor. Examples are sorted by class notice the four distinctive blocks and, within each class, by date. Drift-reduced trajectories have been shifted vertically for visualization purposes. Both and versions of the algorithm significantly reduce the linear component of the drift. In other words, the algorithm is able to reduce driftrelated variance (noise) while preserving odour-related variance (information.) It is also interesting to notice that the first WTS window is deflated almost entirely, a reasonable result since it is highly correlated with the last WTS window of the reference transient. WTS # WTS #2 WTS #3 WTS #4 RAW C C 2 C 3 C 4 Figure 3. Trajectory of WST features before and after drift reduction 4. Analysis of predictive accuracy A more interesting and formal assessment of performance is predictive accuracy, that is, the ability of a subsequent pattern classifier to correctly classify previously unseen test data. In the context of drift-reduction, predictive accuracy should be measured by using test examples that were collected on different days than those used for training, as illustrated in Figure 4. Formally, given a database collected over M days, we seek to classify data from day k using data from days i through j, with i<j<k. We denote by W (width) the number of days used for training (W=j-i+) and D (distance) the number of days between the training and test days (D=k-j.) These two parameters, also shown in Figure 4, can significantly affect the predictive accuracy of a classifier: With increasing values of W, the training set incorporates data from more days, which may allow a classifier to automatically average out the drift component. With increasing values of D, the effects of drift accumulate on the test set, which may result in lower predictive accuracy. 4

Training set Test set Day: 2 i j k M W D Figure 4. Illustration of W and D for predictive accuracy measurements To analyze the effect on predictive accuracy of these two parameters along with the proposed drift-reduction technique, we use a k-nearest-neighbor (knn) classifier [7]. This voting rule classifies an unlabeled example by choosing the majority class among the k closest examples in the training set. Our implementation uses the Euclidean metric, with the number of neighbors being set to one-half the average number of examples per class in the training set. To limit the curse of dimensionality, the feature vectors y and z are pre-processed with PCA, and the largest eigenvalues containing 99% of the variance are selected. PCA eigenvectors are computed from the training set and then used to project test data. Depending on the number of training examples (a function of W,) this results in 6 to 0 principal components being retained for y and 3 to 6 principal components for z. As shown in the previous section, the drift-reduction algorithm significantly deflates the first WST feature, which explains why fewer principal components are required to total the same percentage of the variance. It must be noted that this new analysis is more challenging than the preliminary validation shown in Figure 3, in which the entire dataset (days through 24) was passed to the algorithm. In the more realistic scenario described in this section, only past training data (days i through j) and current test data (day k) are used to perform drift reduction. Intuitively, with fewer training data, fewer latent variables should be used to deflate the odour vector y. In fact, using ten latent variables yields only modest improvements in predictive accuracy compared to raw data. After some experimentation, we determined that three was a suitable number of latent variables. We also noticed that in order to eliminate the linear component of the drift (see Figure 3) it was necessary to incorporate timing information into the drift reduction algorithm, especially for large values of D. For this reason, we decided to augment the regression vector x with the time stamp of each sample. This had not been necessary in the previous section, where the entire sequence of days was presented to the drift-reduction algorithm. With the selection of the first three latent variables and the addition of time stamps, we were able to obtain significant improvements in predictive accuracy, which are reported below. The effect of distance on predictive accuracy is shown in Figure 5 for different values of the width parameter. As expected, the performance of the raw data decreases considerably with increasing distances (older training data.) After the application of - or -based drift reduction, predictive accuracy does not decrease with distance, an indication that the algorithm is able to compensate for drift. In fact, and present a slight increase in predictive accuracy with distance. This counter-intuitive result could be attributed to the fact that both training data and unlabeled test data are used to estimate the drift model y pred =W OLS By, making the slope W OLS B more accurate with increasing distances as a result of the time stamps. This conjecture, however, deserves further attention. 5

Width= Width=5 Width=0 Predictive Accuracy 0.9 0.8 0.7 0.6 0.5 RAW Distance Distance Figure 5. Effect of Distance on predictive accuracy for different Widths Distance The effect of width is shown in Figure 6. The three datasets yield higher predictive accuracy with increasing values of width (larger training sets.) Higher slopes in the and curves indicate that the drift-reduction technique allows the subsequent pattern classifier to make better use of additional training data up to a saturation point of 95% predictive accuracy, which can be clearly observed in the rightmost plot of Figure 6. No significant differences in predictive accuracy between and can be found in these two figures. Distance= Distance=5 Distance=0 Predictive Accuracy 0.9 0.8 0.7 0.6 0.5 RAW Width Width Width Figure 6. The effect of Width on predictive accuracy for different Distances 5. Conclusions and future work We have proposed a drift-reduction algorithm that takes advantage of drift information contained in the pre-wash/reference stages of a typical sampling cycle. The algorithm has been shown to reduce drift-related variance in the sensors and, therefore, enhance class discrimination and predictive accuracy of a subsequent pattern classifier. Our experience has shown that the predictive-accuracy performance of the algorithm is sensitive to the number of latent variables, which has been manually selected in the current implementation. Automated selection of latent variables by cross-validation techniques constitutes the next natural step of this work. Further improvements could be obtained by augmenting the regression vector x with temperature, humidity and mass flow information, as well as transient information from appropriate calibration mixtures that could be added to the sampling cycle. Finally, the proposed algorithm is limited to linear dependencies between W/R and odour stages. Non-linear extensions of [8] and [9] will be required for situations where these relationships are markedly non-linear. 6

6. Acknowledgements This research was supported by awards WSU/OBR 0409 and NSF/CAREER 9984426. We would like to acknowledge Alex Perera (Universitat de Barcelona) for his valuable comments and suggestions on earlier drafts of this document. 7. References [] Gardner J W and Bartlett P N 999 Electronic Noses: Principles and Applications (Oxford: Oxford Univ. Press) [2] Kermani B G 996 On Using Artificial Neural Networks and Genetic Algorithms to Optimize Performance of an Electronic Nose Ph.D. Dissertation, North Carolina State University. [3] Gutierrez-Osuna R and Nagle H T 999 IEEE Trans. Syst. Man Cybern. B 29(5), 626-632. [4] Dillon W R and Goldstein M 984 Multivariate Analysis: Methods and applications (New York: Wiley) [5] Geladi P and Kowalski B R 986 Analytica Chimica Acta 85-7. [6] Burnham A J et al. 999 J Chemometrics 3 49-65 [7] Duda R O and Hart P E 973 Pattern classification and Scene Analysis (New York: Wiley) [8] Lai P L and Fyfe C 999 Neural Networks 2(0), 39-397. [9] Malthouse E C et al. 997 Computers chem. Engng 2(8) 875-890 7