INTRODUCTION THEORY. 402 Volume 65, Number 4, 2011 APPLIED SPECTROSCOPY

Similar documents
MS-C1620 Statistical inference

ReducedPCR/PLSRmodelsbysubspaceprojections

Regularization and Variable Selection via the Elastic Net

Linear Model Selection and Regularization

ISyE 691 Data mining and analytics

EXTENDING PARTIAL LEAST SQUARES REGRESSION

Machine Learning for OR & FE

In Search of Desirable Compounds

Generalized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc.

Linear Methods for Regression. Lijun Zhang

Generalized Elastic Net Regression

Saharon Rosset 1 and Ji Zhu 2

Chapter 3. Linear Models for Regression

Ridge and Lasso Regression

Model Updating for Spectral Calibration Maintenance and Transfer Using 1-Norm Variants of Tikhonov Regularization

Comparisons of penalized least squares. methods by simulations

Regularization: Ridge Regression and the LASSO

Tuning Parameter Selection in L1 Regularized Logistic Regression

Linear regression methods

Iterative Selection Using Orthogonal Regression Techniques

ESL Chap3. Some extensions of lasso

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Prediction & Feature Selection in GLM

The lasso, persistence, and cross-validation

A note on the group lasso and a sparse group lasso

Lecture 14: Variable Selection - Beyond LASSO

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Regularization Path Algorithms for Detecting Gene Interactions

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Regression, Ridge Regression, Lasso

On High-Dimensional Cross-Validation

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Lass0: sparse non-convex regression by local search

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Shrinkage regression

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Regression Shrinkage and Selection via the Lasso

A Modern Look at Classical Multivariate Techniques

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Applied Machine Learning Annalisa Marsico

arxiv: v1 [stat.me] 30 Dec 2017

The lasso: some novel algorithms and applications

Analysis Methods for Supersaturated Design: Some Comparisons

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Learning with Singular Vectors

LEAST ANGLE REGRESSION 469

Pathwise coordinate optimization

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Some Sparse Methods for High Dimensional Data. Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Dynamic-Inner Partial Least Squares for Dynamic Data Modeling

High-dimensional regression modeling

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Sparse Ridge Fusion For Linear Regression

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear model selection and regularization

PAT: THE EXTRACTION OF MAXIMUM INFORMATION FROM MESSY SPECTRAL DATA. Zeng-Ping Chen and Julian Morris

Compressed Sensing in Cancer Biology? (A Work in Progress)

Lecture 17 Intro to Lasso Regression

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Data Mining Stat 588

A Comparison between LARS and LASSO for Initialising the Time-Series Forecasting Auto-Regressive Equations

Day 4: Shrinkage Estimators

STK4900/ Lecture 5. Program

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Variable Selection for Highly Correlated Predictors

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING

Univariate shrinkage in the Cox model for high dimensional data

The prediction of house price

Lecture 14: Shrinkage

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Application of Raman Spectroscopy for Detection of Aflatoxins and Fumonisins in Ground Maize Samples

Least Angle Regression, Forward Stagewise and the Lasso

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Learning with Sparsity Constraints

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Supplementary material (Additional file 1)

The Risk of James Stein and Lasso Shrinkage

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Lecture 5: Soft-Thresholding and Lasso

Design and experiment of NIR wheat quality quick. detection system

Bayesian Grouped Horseshoe Regression with Application to Additive Models

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Robust Variable Selection Through MAVE

Regularization Paths. Theme

Stability and the elastic net

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Fast Regularization Paths via Coordinate Descent

Or How to select variables Using Bayesian LASSO

Cross model validation and multiple testing in latent variable models

Transcription:

Elastic Net Grouping Variable Selection Combined with Partial Least Squares Regression (EN-PLSR) for the Analysis of Strongly Multi-collinear Spectroscopic Data GUANG-HUI FU, QING-SONG XU,* HONG-DONG LI, DONG-SHENG CAO, and YI-ZENG LIANG School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, P.R. China (G.-H.F., Q.-S.X.); and Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P.R. China (H.-D.L., D.-S.C., Y.-Z.L.) In this paper a novel wavelength region selection algorithm, called elastic net grouping variable selection combined with partial least squares regression (EN-PLSR), is proposed for multi-component spectral data analysis. The EN-PLSR algorithm can automatically select successive strongly correlated prediction variable groups related to the response variable using two steps. First, a portion of the correlated predictors are selected and divided into subgroups by means of the grouping effect of elastic net estimation. Then, a recursive leave-one-group-out strategy is employed to further shrink the variable groups in terms of the root mean square error of cross-validation (RMSECV) criterion. The performance of the algorithm with real near-infrared (NIR) spectroscopic data sets shows that the EN-PLSR algorithm is competitive with full-spectrum PLS and moving window partial least squares (MWPLS) regression methods and it is suitable for use with strongly correlated spectroscopic data. Index Headings: Elastic net; Grouping variable selection; Partial least squares regression; PLSR; Multi-collinear analysis; Near-infrared spectroscopy; NIR spectroscopic data; EN-PLSR; Wavelength region selection algorithm. INTRODUCTION In chemistry, multivariate calibration methods are widely applied for investigating multi-component spectroscopic data. So far, a number of popular methods have been utilized to achieve this goal: (1) principal component regression (PCR), 1,2 (2) partial least squares (PLS), 3 10 (3) ridge regression, 11 (4) artificial neural networks (ANN), 12,13 and (5) least absolute shrinkage and selection operator (LASSO). 14 Also, a large number of generalized methods or algorithms based on the above-mentioned models have been proposed recently, such as elastic net estimator, 15 moving window partial least squares regression (MWPLS), 16 Bayesian linear regression with variable selection (BLR-VS), 17 and the competitive adaptive reweighted sampling (CARS) method for multivariate calibration. 18 The LASSO method finds a least-square solution with the constraint on the L 1 -norm of the regression coefficient vector. The key variables for LASSO can be selected step by step according to the LARS algorithm. 19 Elastic net is the generalization of LASSO and can be used more widely than LASSO in multi-component analysis due to its advantages over LASSO, for example, the grouping effect, which is introduced in the following section. One of the challenges for analysis of near-infrared spectral data in chemometrics is how to deal with the p.. n problem, where n is the number of the observations and p is the number of the variables (i.e., factors or predictors). Generally speaking, grouped variables should be of concern in large p and small n problems. 20 Chemists often take many factors into consideration in order to measure properties with precision, whereas only a few observations can be obtained due to difficulties met in the experiments. Statisticians advise employing methods such as cross-validation (CV) and group variable methods when the size of a sample is small. Techniques for grouping variables have been studied many times for large-scale predictors. 20 24 Another challenge is multi-collinearity or correlation of the predictors, because traditional multiple linear regression (MLR) fails in such situations. Thus, novel data analysis methods should be explored for this special case. In this paper we focus on addressing near-infrared (NIR) spectroscopic data, which is generally multi-component as well as multi-collinear. By combining elastic net with partial least squares regression, we propose a new algorithm called EN- PLSR to cope with this case. Under the data-driven principle in statistics, the EN-PLSR algorithm would automatically obtain a number of strongly correlated variable groups through elastic net estimation, and subsequently recursive PLS calibration models are built based on these variable groups. The performance for real data sets shows that the EN-PLSR algorithm has good prediction accuracy and is extraordinarily suitable for strongly correlated data. The remainder of the paper is organized as follows: the Theory section offers the basic theory including variable selection and the grouping effect of the elastic net. The Algorithm section gives the details of the EN-PLSR algorithm. The Experimental and Results sections provide a description of the data sets used and give the experimental results and discussion, respectively. Finally, we give a summary and further discussion. THEORY Variable Selection. Let y ¼ (y 1, y 2,..., y n ) T be the response vector and X ¼ (x 1, x 2,..., x p ) be the predictor matrix, where x j ¼ (x 1j, x 2j,..., x nj ) T ( j ¼ 1, 2,..., p) are p predictors. For simplicity, we also suppose that the data are standardized to have zero mean and unit length for every variable. Namely, Received 19 July 2010; accepted 12 January 2011. * Author to whom correspondence should be sent. E-mail: qsxu@mail. csu.edu.cn. DOI: 10.1366/10-06069 X n i¼1 y i ¼ 0; X n i¼1 x ij ¼ 0; X n i¼1 x 2 ij ¼ 1; ð j ¼ 1; 2;... pþ ð1þ 402 Volume 65, Number 4, 2011 APPLIED SPECTROSCOPY 0003-7028/11/6504-0402$2.00/0 Ó 2011 Society for Applied Spectroscopy

(see Fig. 1) and the optimal point of LASSO is exactly located in the variable coordinate frame (for example, point A in Fig. 1). The Grouping Effect of Elastic Net. Elastic net, 15 an improved version of LASSO for analyzing high-dimensional data, is defined as follows: n o ˆb ¼ð1þk 2 Þ arg min jjy Xbjj 2 2 þ k 2jjbjj 2 2 þ k 1jjbjj 1 b ð4þ FIG. 1. Two-dimensional LASSO penalty (dashed) and elastic net penalty (solid). LASSO penalty field is convex and elastic net penalty field is strictly convex. In multivariate calibration, one usually considers the following linear regression: y ¼ Xb þ e ¼ x 1 b 1 þ x 2 b 2 þx p b p þ e where b ¼ (b 1, b 2,..., b p ) T is the regression coefficient vector. e is usually the Gauss noise, namely, e ; N(0, r 2 I), where I is the identity matrix. Generally, the number of predictors p is very large compared with the number of observations n. Thus, chemists pay more attention to seeking a sparse solution in the application for the sake of scientific insight into the predictor response relationship. Sparse means that a large proportion of elements of b are zero. The traditional method of solving the model given by Eq. 2 is to minimize the residual sum of squares by ordinary least squares (OLS). However, the OLS solutions are generally not sparse. Tibshirani proposed LASSO 14 by using L 1 -norm penalization on the ordinary least squares. The LASSO estimator solves the following optimal objective: subject to ð2þ ˆb ¼ arg min b jjy Xbjj 2 2 ð3aþ where k 1 and k 2 are two non-negative parameters. If k 2 ¼ 0, elastic net is exactly equivalent to LASSO. The scale factor (1 þ k 2 ) should be changed to [1 þ (k 2 /n)] when the variables are not standardized to have mean zero and L 2 -norm one. As shown in Eq. 4, elastic net estimation can be viewed as a penalized least squares methods. The penalty k 2 jjbjj 2 2 þ k 1jjbjj 1 is the convex combination of the LASSO and ridge penalties. Elastic net is the generalization of LASSO, so it can automatically perform variable selection. On the other hand, elastic net estimation can be employed to deal with data with large p and small n, which often appears when carrying out multi-component spectral analysis. It can even select all the p variables in the extreme, whereas LASSO can at most select n predictors. Another important property of elastic net is the grouping effect. Generally speaking, if the regression coefficients of the strongly correlated variables t to be equal (in the absolute value sense), the regression model exhibits a grouping effect. Particularly, when two variables are exactly identical, their coefficients in the regression model should be identical. Hui Zou et al. have pointed out that two completely identical predictors have identical coefficients in the model of Eq. 4 when k 2. 0 due to the strictly convex elastic net penalty k 2 jjbjj 2 2 þ k 1jjbjj 1 (see Fig. 1). Suppose that the predictors V i and V j are strongly correlated. According to the grouping effect of elastic net, the regression coefficients b i and b j should t to be similar, namely, synchronously zero or non-zero. In other words, the predictors V i and V j are synchronously selected or synchronously eliminated. Further assuming that the predictor V j is the most correlated variable with the predictor V i among all the predictors, and in the current step the predictor added to the model of Eq. 2 is V i (i.e., its regression coefficient b i varies from zero to non-zero), then theoretically, the variable added to the model of Eq. 2 in the next step ts to be predictor V j. This is the basis of the grouping strategy of the EN-PLSR algorithm for application to spectroscopic data. Generally speaking, in the full wavelength range of spectroscopic data, two successive channels (predictors) are strongly correlated. Thus, they should be in or out of the model (Eq. 2) together. X p j¼1 jb j jt ð3bþ where t is the non-negative tuning parameter. The LASSO can automatically perform regression and variable selection simultaneously. It shrinks the regression coefficients of the unimportant predictors to exactly zero by controlling the parameter t. This is because the L 1 -norm constraint R p j¼1 jb jjt is convex and singular at the origin 25 FIG. 2. The variable shrinkage by two steps with multi-component and strongly correlated spectroscopic data. APPLIED SPECTROSCOPY 403

FIG. 3. The curves of RMSECV with full-spectrum PLS, MWPLS, and EN-PLSR methods for m5psec combined with the four responses of the corn data. Furthermore, they are still successive in the solution sequence of elastic net. EN-PLSR ALGORITHM Motivation and Aim. Generally speaking, we often use prediction accuracy to judge whether a regression model is a good one. But in the large p and small n case, parsimony should also be considered: simpler models are preferred for the sake of scientific insight into the data structure and alleviation of the computational burden. In addition, shrinking the model can often improve the prediction accuracy. 26 So the method of reducing the model complexity is extremely important. The single-variable forward stepwise selection or backward stepwise elimination method is not a good strategy when the number of predictors is ultra-high. 27 Group variable selection is naturally proposed in this situation. In this work, we propose a group variable selection method that combines elastic net and TABLE I. The results for m5spec associated with four responses: moisture, oil, protein, and starch from the corn data set. No.LVs and No.Var are the number of components of the PLS and the number of selected variables, respectively. Method Moisture Oil Protein Starch PLS RMSECV 0.0116 0.069 0.1289 0.4607 No.LVs 15 8 10 5 No.Var 700 700 700 700 MWPLS RMSECV 0.0367 0.017 0.1143 0.1206 No.LVs 10 11 9 14 No.Var 122 119 191 247 EN-PLSR RMSECV 0.0009 0.0129 0.0131 0.0891 No.LVs 14 9 12 10 No.Var 40 a 69 b 90 c 80 d a k 2 ¼ 0.5, s ¼ 0.42. b k 2 ¼ 3000, s ¼ 0.05. c k 2 ¼ 200, s ¼ 0.38. d k 2 ¼ 100, s ¼ 0.3. 404 Volume 65, Number 4, 2011

TABLE II. The selected key predictors and the corresponding wavelength regions for the corn data set. Method Response Selected predictors Selected wavelength regions (nm) PLS moisture 1 700 1100 2498 oil 1 700 1100 2498 protein 1 700 1100 2498 starch 1 700 1100 2498 MWPLS moisture 131 177, 354 384, 442 485 1360 1452, 1806 1866, 1982 2068 oil 282 331, 569 637 1662 1760, 2236 protein 5 59, 264 280, 304 337, 419 447, 458 486, 515 541 2372 1108 1216, 1626 1658, 1706 1772, 1936 1992, 2014 2070, 2128 2180 starch 1 93, 261 338, 420 446, 456 483, 515 535 1100 1284, 1620 1774, 1938 1990, 2010 2064, 2128 2168 EN-PLSR moisture 398 407, 482 511 1894 1912, 2062 2120 oil 306 314, 497 626, 575 604 1710 1726, 2092 2150, 2248 2306 protein 319 348, 499 558 1736 1794, 2096 2214 starch 304 307,321 350,411 426, 664 693 1706 1712, 1740 1798, 1920 1950, 2426 2484 PLSR. First, the predictors are preliminarily shrunk and the high collinear predictors that are kept are automatically grouped according to the grouping effect of elastic net. Then, a leave-one-group-out strategy is followed to further shrink the variables. More details are given below. Algorithm Description. The EN-PLSR algorithm mainly contains two steps. First, elastic net estimation is employed to shrink the predictors. All the predictors are selected automatically by elastic net estimation and a variable sequence is obtained. Then a portion of the variables (e.g., 40%, which can be decided by cross-validation) are selected as key variables. We have noted that the key variables selected by elastic net are piecewise successive for the collinear spectroscopic data in most cases. Under the constraint that the number of elements of any variable group cannot exceed some constant m (e.g., m ¼ 30), the EN-PLSR algorithm treats every continuous predictor segment as a subgroup in which the predictors are highly correlated for the sake of the grouping effect of elastic net. Then, EN-PLSR builds a PLSR model by using the aboveselected group variables and computes its root mean square error of cross-validation (RMSECV 0 ). The recursive leave-onegroup-out strategy is employed to further shrink the predictors in this step. In other words, a series of PLSR models are built using all the variable groups but one. Their corresponding RMSECV h is calculated and if the minimum of RMSECV h is less than RMSECV 0, the corresponding variable group is left out and the above procedure is repeated. It is worth noting that the three tuning parameters k 1, k 2, and A should be validated beforehand. The pseudo-codes of the EN-PLSR algorithm are supplied in the Appix. The EN-PLSR algorithm can be seen as a two-step variable shrinkage (see Fig. 2). First, some uninformative variables are automatically eliminated by elastic net. Second, the recursive leave-one-group-out strategy further shrinks the variables in terms of PLSR in the RMSECV sense. Tuning Parameters. Three tuning parameters, the number of latent variables A, the L 1 -norm penalty k 1, and the L 2 -norm penalty k 2 (see Eq. 4), must be further evaluated for prediction accuracy. However, k 1 can be replaced by the fraction of the L 1 -norm (s), which is always valued within [0,1]. In this paper, we choose k 2, s, and A as tuning parameters and they are estimated by the traditional ten-fold cross-validation method. In practice, we set a grid of three tuning parameters, for example, k 2 can be set to the following ten values: k 2 ¼f0.01, 0.1, 0.5, 1, 2, 5, 10, 100, 1000, 10000g. The range of s is [0,1] and is equally divided into 100 intervals. For the number of latent variables A, we let it be 1 to 15. Thus, we make use of 10 3 100 3 15 ¼ 1.5 3 10 4 grid points to search for the optimal combination of model parameters in terms of RMSECV, which is defined as follows: 8 9 >< RMSECV ¼ >: X K Xh i 2 y i ^y ð kþ i >= k¼1 i¼1 where y i is the measure value of the ith (i ¼ 1, 2,..., n) sample and ^y ð kþ i is the predicted value obtained by leaving the kthfold samples out. EXPERIMENTAL Data Set A: Corn. Data set A is cited from Ref. 28, which consists of 80 samples of corn measured on three different NIR spectrometers. The wavelength range is 1100 2498 nm at 2 nm intervals and thus it contains 700 predictors (or variables) measured by three instruments called m5, mp5, and mp6 and correspondingly obtains three predictor matrices called m5spec, mp5spec, and mp6spec, respectively. The predictors are generally strongly correlated. Taking m5spec for example, 93.4% of the variables have correlation coefficients more than 0.92, and 49.4% of the variables have correlation coefficients more than 0.99. The moisture, oil, protein, and starch values for each of the samples are also included as response variables and stored in the response matrix propvals. In this study, we employ m5spec as the predictor matrix together with four responses to investigate the performance of the EN-PLSR algorithm. Data Set B: Gasoline. The second data set, from Ref. 29, is another NIR spectral data set with NIR spectra and octane numbers of 60 gasoline samples. The NIR spectra were measured using diffuse reflection as log (1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401 wavelengths (variables). RESULTS AND DISCUSSION Data Set A: Corn. In this experiment, we used the m5spec as the predictor matrix together with the four responses to correspondingly build four models for investigating the EN-PLSR algorithm. Each predictor and response was first standardized to have zero mean and unit length. The maximal number of each variable subgroup was set to 30. The results of full-spectrum PLS and MWPLS 16 are also shown for the purpose of comparisons. We use Fig. 3 and Table I to describe the RMSECV for the four responses. We set the upper limit of the number of latent variables (LVs) to be 15. The RMSECV decreases as the n >; 1=2 ð5þ APPLIED SPECTROSCOPY 405

FIG. 4. EN-PLSR algorithm on m5spec of corn data set A with four responses. The finally selected variable groups (selected regions) are (a) 1894 1912 nm and 2062 2120 nm; (b) 1710 1726 nm, 2092 2150 nm, and 2248 2306 nm; (c) 1736 1794 nm and 2096 2214 nm; and (d) 1706 1712 nm, 1740 1798 nm, 1920 1950 nm, and 2426 2484 nm. number of LVs increases. The minimal point for judging how many LVs should be chosen is not obvious or does not exist for the four responses. Large LVs is not a wise choice, so we use the standard introduced in Ref. 8: a component is judged significant if the ratio RMSECV a /RMSECV a 1 is smaller than 0.95 in this experiment, where RMSECV a is the RMSECV obtained by choosing a (a ¼ 2,..., 15) LVs. According to this criterion, the numbers of LVs for the model are, respectively, 14, 9, 12, and 10 for the four responses. Obviously, the EN- PLSR algorithm has the lowest RMSECV compared with the other two methods (see Table I). For the responses moisture and protein, RMSECV decreases faster with the increase of LVs than for the other two methods (see Fig. 3). The selected wavelength regions and the selected predictors for the four responses are shown in Table II and Fig. 4. Compared to MWPLS, the number of selected variables is much smaller for EN-PLSR for each response. This is very important in practice because the final regression model is more parsimonious. It is also interesting to note that some of the wavelength bands selected by the EN-PLSR algorithm are chemically meaningful. For example, the 2062 2120 cm 1 band should be ascribed to water absorption 30 and the combination of the O H bond. 16 As is well known, band assignment is very difficult because, to the best of our knowledge, each wavelength of an NIR spectrum is influenced by a number of compounds and/or different vibration modes. However, we also found that some of wavelength bands selected by EN-PLSR are chemically meaningful according to Table 1 in the appix of Ref. 31. Data Set B: Gasoline. We employed the same data 406 Volume 65, Number 4, 2011

TABLE III. RMSECV, selected key predictors, and corresponding wavelength regions for the gasoline data set. Method RMSECV No. LVs No. Var Selected predictors Selected wavelength region (nm) PLS 0.210 7 401 1 401 900 1700 MWPLSR 0.198 5 180 107 179, 202 275, 342 374 EN-PLSR a 0.170 10 69 12 13, 115 129, 149 178, 228 234, 336 337, 361 362, 381 391 a k 2 ¼ 200, s ¼ 0.32. 1112 1256, 1302 1448, 1582 1646 922 924, 1128 1156, 1196 1254, 1354 1366, 1570 1572, 1620 1622, 1660 1680 We note that the variables in each group are strongly correlated, which indirectly supports our algorithm. FIG. 5. The EN-PLSR algorithm on gasoline data set B. The finally selected variable groups are 922 924, 1128 1156, 1196 1254, 1354 1366, 1570 1572, 1620 1622, and 1660 1680 nm. FIG. 6. The curves of RMSECV with full-spectrum PLS, MWPLS, and EN- PLSR methods for the gasoline data set B. pretreatment and criterion for data set B as we did for data set A in this experiment, but we chose the optimal number of LVs corresponding to the minimum RMSECV. Seven variable groups with 69 predictors in total were finally selected as key variables (see Fig. 5). The RMSECV of full-spectrum PLS, MWPLS, and EN-PLSR methods are described in Fig. 6 and Table III. The RMSECV values of MWPLS and EN-PLSR are close; however, the number of key variables selected via EN- PLSR is much smaller than the number selected by MWPLS. CONCLUSIONS Variable selection for multi-collinear as well as strongly correlated data with small observations and large predictors is an issue of importance in chemometrics, especially in spectroscopic analysis. We propose in this paper a new method called EN-PLSR to deal with multi-collinear NIR spectroscopic data. The EN-PLSR algorithm is a variable grouping (or interval) selection method that focuses on automatically subgrouping the candidate variables and obtaining a parsimonious solution model while as much as possible retaining all the relevant information. The predictors in every group selected by the EN-PLSR algorithm are successive and strongly correlated. In the Experimental section, we employed NIR spectroscopic data to evaluate our algorithm, but the EN-PLSR methods can be applied to a wide variety of fields, for example, ultraviolet visible (UV-Vis) and mid-infrared (MIR) spectroscopic data analysis. In order to obtain the optimal solution, three tuning parameters, the L 1 and L 2 -norm penalty coefficients in elastic net, as well as the number of components in PLSR, should be carefully validated. ACKNOWLEDGMENTS I would like to express my gratitude to Prof. Rolf Manne, University of Bergen, Norway for his advice on this paper. This work was financially supported by the National Natural Science Foundation of P.R. China (Grants No. 10771217 and No. 20875104), and the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680). The studies meet with the approval of the university s review board. 1. T. Næs and H. Martens, J. Chemom. 2, 155 (1988). 2. M. K. Hartnett, G. Lightbody, and G. W. Irwin, Chemom. Intell. Lab. Syst. 40, 215 (1998). 3. P. Geladi and B. R. Kowalski, Anal. Chim. Acta 185, 1 (1986). 4. A. Höskuldsson, J. Chemom. 2, 211 (1988). 5. F. Lindgren, P. Geladi, and S. Wold, J. Chemom. 7, 45 (1993). 6. S. De Jong and C. J. F. ter Braak, J. Chemom. 8, 169 (1994). 7. B. S. Dayal and John F. Macgregor, J. Chemom. 11, 73 (1997). 8. S. Wold, M. Sjöström, and L. Eriksson, Chemom. Intell. Lab. Syst. 58, 109 (2001). 9. Q.-S. Xu, Y.-Z. Liang, and H.-L. Shen, J. Chemom. 15, 135 (2001). 10. H. Abdi, Wiley Interdisc. Rev.: Comput. Statist. 2, 97 (2010). 11. E. Vigneau, M. F. Devaux, E. M. Qannari, and P. Robert, J. Chemom. 11, 239 (1997). APPLIED SPECTROSCOPY 407

12. F. Despagne, D.-L. Massart, and P. Chabot, Anal. Chem. 72, 1657 (2000). 13. H. H. Thodberg, IEEE Trans. Neural Networks 7, 56 (1996). 14. R. Tibshirani, J. R. Statist. Soc. B 58, 267 (1996). 15. H. Zou and T. Hastie, J. Roy. Statist. Soc., Ser. B 67, 301 (2005). 16. J.-H. Jiang, R. J. Berry, H. W. Siesler, and Y. Ozaki, Anal. Chem. 74, 3555 (2002). 17. T. Chen and E. Martin, Anal. Chim. Acta 631, 13 (2009). 18. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, Anal. Chim. Acta 648, 77 (2009). 19. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Annal. Statist. 32, 407 (2004). 20. M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, Jr., J. R. Marks, and J. R. Nevins, Proc. Natl. Acad. Sci. U.S.A. 98, 11462 (2001). 21. T. Hastie, R. Tibshirani, M. Eisen, A. Alizadeh, D. Ross, U. Scherf, J. Weinstein, R. Levy, L. Staudt, W. C Chan, D. Botstein, and P. Brown, Genome Biol. 1, 1 (2000). 22. R. Daz-Uriarte, Technical Report, Spanish National Cancer Center (2003). 23. M. Dettling and P. Bühlmann, J. Multivariate Anal. 90, 106 (2004). 24. M. R. Segal, K. D. Dahlquist, and B. R. Conklin, J. Comput. Biol. 10, 961 (2003). 25. J. Fan and R. Li, J. Am. Statist. Assoc. 96, 1348 (2001). 26. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Leraning (Springer-Verlag, NY, 2001). 27. J. Fan and J. Lv, J. Royal Statist. Soc.: Ser. B (Statistical Methodology) 70, 849 (2008). 28. http://software.eigenvector.com/data/index.html 29. J. H. Kalivas, Chemom. Intell. Lab. Syst. 37, 255 (1997). 30. D. Jouan-Rimbaud, D.-L. Massart, R. Leardi, and O. E. DeNoord, Anal. Chem. 67, 4295 (1995). 31. H. W. Siesler, Y. Ozaki, S. Kawata, and H. M. Heise, Eds., Near Infrared Spectroscopy: Principals, Instruments, Applications (Weinheim, Germany, Wiley-VCH, 2002). APPENDIX The Pseudo-codes of the EN-PLSR Algorithm. Input: Predictor matrix X ¼ (x 1, x 2,..., x p ), response vector y 2 R n31, d ¼ 1, m ¼ 30, RMSEV h ¼ 0. Using elastic net to select variable sequence, their indices denoted by a 1, a 2,..., a p1 by order (p 1, p); G d ¼fx a1 g for i ¼ 2top 1 do if (a i a i 1 ¼ 1 and card(g d ), m) then G d ¼fG d, x ai g; else d ¼ d þ 1, G d ¼fx a1 g; Compute RMSECV 0 of PLSR model by all d groups, d ¼ d þ 1; While RMSECV h, RMSECV 0 do Leave G h out from d groups; d ¼ d 1, RMSECV 0 ¼ RMSECV h ; for j ¼ 1tod do Compute RMSECV j of PLSR model by G 1,..., G j 1, G jþ1,..., G d RMSECV h ¼ arg minfrmsev 1,..., RMSECV d g Output: RMSECV 0, fg 1, G 2,..., G d g. Note, card(g d ) is the number of elements of the set G d. 408 Volume 65, Number 4, 2011