Chemometrics: Classification of spectra

Size: px

Start display at page:

Download "Chemometrics: Classification of spectra"

Jacob Harper
5 years ago
Views:

1 Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36

2 Contents Terminology Introduction Big picture Support Vector Machine Introduction Linear SVM classifier Nonlinear SVM classifier KNN classifier Cross-validation Performance evaluation Vladimir Bochko Chemometrics: Classification 2/36

3 Terminology The task of pattern recognition is to classify the objects into a number of classes. Objects are called patterns or samples. Measurements of particular object parameters are called features or components or variables. The classifier computes the decision curve which divides the feature space into regions corresponding to classes. The class is a group of objects characterized by similar features. The decision may not be correct. In this case a misclassification occurs. The patterns used to design the classifier are called training (or calibration) patterns. The patterns used to test the classifier are called test patterns. Vladimir Bochko Chemometrics: Classification 3/36

4 Introduction If the training data is available then we tell about supervised pattern recognition. If the training data is not available then we tell about unsupervised pattern recognition or clustering. We consider only supervised pattern recognition. In this case the training set consists of data X and class labels Y. When we test the classifier using the test data X, the classifier predicts class lables Y. Thus, classification requires training or calibration and test. SOLO GUI [3] has buttons: calibration and test/validation: Vladimir Bochko Chemometrics: Classification 4/36

5 Abbreviations KNN - K-nearest neighbor classifier. The KNN classifier requires labels. SVM - Support Vector Machines. The SVM classifier requires labels. PLS - Partial Least Squares. The mapping (compression), regression and classification technique. PLS requires labels. PCA - Principal Component Analyzis. The mapping (compression) technique. Labels are not needed. DA - Discriminant Analysis, e.g. PLSDA, SVMDA. DA means that classification is used. MCS - Multiplicative Scatter Correction. The preprocessing technique. SNV - Standard Normal Variate transformation. The preprocessing technique. Vladimir Bochko Chemometrics: Classification 5/36

6 Big picture Classification/prediction Measured spectra Preprocessing MCS, SNV, smoothing derivatives PCA, PLS Classifier Knn, SVM, PLS SVMDA: classifire design Training X Training Y Tested X Training/validation Model Classification/prediction System evaluation Model Predicted Y Vladimir Bochko Chemometrics: Classification 6/36

7 Example We have green, yellow, orange and red tomato. From the salesman viewpoint the orange and red tomato are suitable for sale. Therefore tomato is divided into two classes: green/yellow and orange/red. Vladimir Bochko Chemometrics: Classification 7/36

8 Measurement The typical measurement system is shown in Figure. Important! Write MEMO during measurement. MEMO includes a name of the file, physical or chemical parameter of object, i.g cheese fatness, and class labels. Computer Light source Spectrometer Light01 01 probe Measured object sensor probe N 1 File name MEMO Parameters Class labels Vladimir Bochko Chemometrics: Classification 8/36

Spectral data Spectra measured by a spectrometer are usually arranged as follows: The first row is wavelength. The first column is a sample number.

9 Spectral data Spectra measured by a spectrometer are usually arranged as follows: The first row is wavelength. The first column is a sample number. The measured spectrum values given in table cells correspond to wavelengths given in nanometers and spectrum numbers. Some spectrum values corrupted by noise are negative. The beginning and the end of the spectra contain mostly noise. Vladimir Bochko Chemometrics: Classification 9/36

10 Spectral data The spectral values are obtained at intervals about 0.27 nm in the range nm. The number of measurement points is 3648 that is unnecessarily high. The example shows how data may be arranged in the data file after measurements with spectrometer. a) Data includes wavelengths and sample numbers. b) Data without wavelengths and sample numbers. In this case a vector of wavelengths should be kept in a separate file. Sample 1 numbers 2 a Wavelegths Spectrum 1 Spectrum 2 Sample number Wavelength 1, 2,... b, 3648 Matrix where entries are spectrum values Data X Labels Y Vladimir Bochko Chemometrics: Classification 10/36

11 Preprocessing The spectra are usually limited to a useful range, e.g nm. Smoothing the spectral signals and then downsampling reduce noise inside the useful interval. After smoothing each spectrum is described by a smaller number of values, e.g In our example 50 values. Tomato spectrum, smoothing 50 Tomato spectrum Smoothed spectrum Transmittance, % Wavelength Indexes 1, 2,... Sample number, , 2,..., Wavelength, nm Input data Smoothed data Vladimir Bochko Chemometrics: Classification 11/36

12 Preprocessing We return to tomato for a while. We may see that tomato of different colors have different spectra. However tomato of the same color have similar spectra. This is good for classification. Vladimir Bochko Chemometrics: Classification 12/36

Preprocessing, PCA PCA generates new features, i.e. principal components, which are the linear combinations of the input components or variables.

13 Preprocessing, PCA PCA generates new features, i.e. principal components, which are the linear combinations of the input components or variables. One may see that the number of components is reduced from 50 to 2. During exercises we will discuss how to select the number of principal comonents. Thus, PCA is a technique for data compression. a) SOLO GUI set up for compression. b) Illustration of PCA. a b 1, 2,..., 50 PCA 1, 2 Principal components(pcs) PC 2 Smoothed data PC 1 Vladimir Bochko Chemometrics: Classification 13/36

14 PCA. Feature selection. How many features or principal components should we use for classification? One way is to analyze the plot of cumulative variances or sum of variances for the training set. The place where the curve sharply changes suggests a number of first principal components to be used. For example, in Figure generated by the SOLO toolbox, first two PCs should be selected. However, very frequently such point is not so clear seen. In addition, these components are not necessarily useful for classification. 100 Eigenvalues 95 Cumulative Variance Captured (%) Principal Component Number Vladimir Bochko Chemometrics: Classification 14/36

15 PCA. Feature selection. The other way shown in the SOLO demo for PCA is to analyze a set of plots: PC1 vs. PC2, PC1 vs. PC3 etc. Then one may select only those features or components which are most efficient for discriminating the classes. This way also may not be efficient in some applications. One may use the probabilistic PCA and cross-validation to find the maximum log likelihood for the given training set. The largest log likelihood defines the number of principal components. In exercises we will use the approach based on cumulative variances. Vladimir Bochko Chemometrics: Classification 15/36

16 PCA The 2-dimensional feature space spanned by first two principal (or embedded) components (Figure). One may see all tomato mapped onto the space. The density of tomato population is shown by gray levels. Remember that we use two classes: green/yellow and orange/red. Vladimir Bochko Chemometrics: Classification 16/36

17 PCA and classification The 2-dimensional feature space and the curve separating two classes (Figure). This curve is determined during the classifier design stage. When the new test pont arrives, it is mapped in one of two regions related to classes. Vladimir Bochko Chemometrics: Classification 17/36

18 SVM. Linear discriminant functions and decision hyperplanes. We consider N samples, two linearly separable classes 1 and 2 and linear discriminant functions. The decision hyperplane in the l-dimensional feature space is g(x) = w T x + w 0 where w = (w 1, w 2,..., w l ) T is a weight vector and w 0 is a threshold. From math the distance of a point from a hyperplane is as follows: z = g(x) w x 2 z x + w g(x)=0 x 1 The discriminant function g(x) takes positive values on one side of the plane and negative values on the other side. Vladimir Bochko Chemometrics: Classification 18/36

SVM. Linearly separable classes. We scale w and w 0 so, that g(x) at the points x 1 and x 2 is equal to 1 for 1 and 1 for 2. The points x 1 and x 2 are closest to the hyperplane.

19 SVM. Linearly separable classes. We scale w and w 0 so, that g(x) at the points x 1 and x 2 is equal to 1 for 1 and 1 for 2. The points x 1 and x 2 are closest to the hyperplane. The margin is 2. In Figure b = w0. Minimizing the norm of a weight vector w makes the margin maximum. Figure is taken from [2]. Vladimir Bochko Chemometrics: Classification 19/36

20 SVM. Linearly separable classes. Out task is to compute the hyperplane. This means that we have to compute w and w 0. We introduce y i (class labels) and y i = 1 for 1 and y i = 1 for 2. Thus, the task is as follows: minimize J(w) = 1 2 w 2 subject to y i(w T x i + w 0) 1, i = 1, 2,..., N where N is a number of samples. This is a quadratic optimization task. The constraints are linear unequalities. Vladimir Bochko Chemometrics: Classification 20/36

21 SVM From computational viewpoint, it is better to represent this optimization problem in a dual representation form. The solution leads to the discriminant function ( n g(x) = i=1 y i α i k(x, x i ) + w 0 ) where α i are support vectors, i = 1, 2,...,n, k(.,.) is a kernel function or kernel, in our case, i.e. linear case, k(x, x i ) =< x, x i > where <.,. > denotes the dot product. Vladimir Bochko Chemometrics: Classification 21/36

22 SVM. Linearly nonseparable classes For linearly nonseparable classes we have three groups of samples. In this case new variables called slack variables are introduced ξ i. Samples which are outside the margin and correctly classified, ξ i = 0. Samples which are inside the margin and correctly classified, 0 < ξ i 1. Samples which are misclassified, ξ i > 1. Vladimir Bochko Chemometrics: Classification 22/36

23 SVM. Linearly nonseparable classes Then the optimization task is as follows: minimize J(w, w 0, ξ) = 1 2 w 2 + C N i=1 ξi subject to y i(w T x i + w 0) 1 + ξ i, i = 1, 2,..., N ξ i 0 where C is a positive constant called a penalization term. It is used in the SOLO toolbox. This problem is again reformulated and soved in a dual representation form. Vladimir Bochko Chemometrics: Classification 23/36

24 SVM. Nonlinear case To obtain solution for nonlinear case we have to use solution for linear case and replace the linear kernel function by the nonlinear kernel function. For example, in the discriminant function obtained earlier ( n g(x) = i=1 y i α i k(x, x i ) + w 0 ) we have to use the nonlinear kernel k(x, x i ). Vladimir Bochko Chemometrics: Classification 24/36

25 SVM. Nonlinear case Most widely used kernel functions are Linear k(x i,x j ) =< x i,x j >= x T i x j Polynomial k(x i,x j ) = (ax T i x j + r) d, a > 0 Radial basis function (RBF) or exponential function k(x i,x j ) = exp( γ x T i x j 2 ), γ > 0. Radial basis function (RBF) is used in the SOLO toolbox. Vladimir Bochko Chemometrics: Classification 25/36

26 SVM. Nonlinear case. The nonlinear discriminant function and nonseparable classes. You may see three groups of samples: outside a margin and correctly classified, inside margin and correctly classified and misclassified shown with a cross. Figure is taken from [2]. We considered C-support vector classifiers. There is also ν-support vector classification which uses a parameter ν (0, 1]. This parameter is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. C-support and ν-support vector classifiers give equal results. Vladimir Bochko Chemometrics: Classification 26/36

27 SVM. Conclusions In many cases the SVM classifies demonstrates the high performance in comparison with the other classifiers. The SVM classifiers are used in many applications including, digit recognition, face recognition, medical imaging and others. The disadvantage of SVM is that the is no technique to select the best kernel. When the kernel is known it is needed to solve optimization to find the values of parameters, for example, RBF parameter γ and penalization term C. This optimization is done in the SOLO toolbox: Build Model and Optimization Results. Vladimir Bochko Chemometrics: Classification 27/36

28 K Nearest Neighbor Classifier The KNN algorithm is simple and performs very well. The KNN algorithm is frequently used as a benchmark method. The KNN classifier is a nonlinear classifier. The nearest neighbor (k = 1) algorithm has asymptotic classification error rate no worse than twice the Bayesian optimal error rate. Vladimir Bochko Chemometrics: Classification 28/36

29 K Nearest Neighbor Classifier The KNN algorithm is as follows: Given a test sample and a distance measure. Out of N training vectors, define k nearest neighbors. Out of the k nearest neighbors, define the number of vectors k i belonging to class i, where i = 1, 2,...,M. The test sample belongs to the class i which has the muximum number k i of samples. ω 1 ω 2 k=3 k 1=2 k 2=1 x Test sample Vladimir Bochko Chemometrics: Classification 29/36

30 K Nearest Neighbor Classifier The distance measures Euclidean distance. The test vector is assigned to the class i if its Euclidean distance from the the class mean point µ i is minimum among the other claases. d e = x µ i Mahalanobis distance. It is assumed that all classes have the same covariance matrix Σ. Again we compute and select a minimum distance for the test sample. d m = ( (x µ i ) T Σ 1 (x µ i ) ) 1/2 Vladimir Bochko Chemometrics: Classification 30/36

31 K Nearest Neighbor Classifier Remarks The drawback of the algorithm that all training vectors must be stored. The other drawback is the complexity in search of nearest neighbors among N training samples. Vladimir Bochko Chemometrics: Classification 31/36

32 Cross-validation The use of training set for evaluating the performance of classification system may give the poor predictive performance. To evalutae the performance of classification system given the different parameters one may use three data sets: training, validation and test. Then after training the peformance may be evaluated using the validation set. The final evaluation is made using the test set. However, when the number of samples of training and test sets is small this approach cannot be used. Vladimir Bochko Chemometrics: Classification 32/36

33 Cross-validation Cross-validation. In this case the available data is divided into several groups where only one group is used for test/validation and the rest groups are used for training. This is repeated so that all of the data are used. Then the results are averaged. Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 When the data set is very small then the leave-one-out method is used. So the test/validation set represents only one sample. Vladimir Bochko Chemometrics: Classification 33/36

34 Confusion matrix The performance of classification system can be assessed using the confusion matrix B(i, j). The entry (i, j) shows a number of samples belonging to the class i which are assigned to class j. In the next example 48 samples of class 1 are correctly classified and 2 samples of class 1 are misclassified. For class 2 45 samples are correctly classified and 5 samples misclassified. For class 3 all samples correctly classified Vladimir Bochko Chemometrics: Classification 34/36

35 References Bishop C. M., Pettern recognition and machine learning. Springer, Scholkopf B. and Smola A. J., Support Vector Machines, Regularization, Optimization, and Beyond. MIT, SOLO toolbox: Page Theodoridis S., Koutroumbas K., Pettern recognition. Academic Press, Vladimir Bochko Chemometrics: Classification 35/36

36 Questions Vladimir Bochko Chemometrics: Classification 36/36

Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table