INTRODUCTION THEORY. 402 Volume 65, Number 4, 2011 APPLIED SPECTROSCOPY

Size: px

Start display at page:

Download "INTRODUCTION THEORY. 402 Volume 65, Number 4, 2011 APPLIED SPECTROSCOPY"

Martha Flowers
6 years ago
Views:

1 Elastic Net Grouping Variable Selection Combined with Partial Least Squares Regression (EN-PLSR) for the Analysis of Strongly Multi-collinear Spectroscopic Data GUANG-HUI FU, QING-SONG XU,* HONG-DONG LI, DONG-SHENG CAO, and YI-ZENG LIANG School of Mathematical Sciences and Computing Technology, Central South University, Changsha , P.R. China (G.-H.F., Q.-S.X.); and Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha , P.R. China (H.-D.L., D.-S.C., Y.-Z.L.) In this paper a novel wavelength region selection algorithm, called elastic net grouping variable selection combined with partial least squares regression (EN-PLSR), is proposed for multi-component spectral data analysis. The EN-PLSR algorithm can automatically select successive strongly correlated prediction variable groups related to the response variable using two steps. First, a portion of the correlated predictors are selected and divided into subgroups by means of the grouping effect of elastic net estimation. Then, a recursive leave-one-group-out strategy is employed to further shrink the variable groups in terms of the root mean square error of cross-validation (RMSECV) criterion. The performance of the algorithm with real near-infrared (NIR) spectroscopic data sets shows that the EN-PLSR algorithm is competitive with full-spectrum PLS and moving window partial least squares (MWPLS) regression methods and it is suitable for use with strongly correlated spectroscopic data. Index Headings: Elastic net; Grouping variable selection; Partial least squares regression; PLSR; Multi-collinear analysis; Near-infrared spectroscopy; NIR spectroscopic data; EN-PLSR; Wavelength region selection algorithm. INTRODUCTION In chemistry, multivariate calibration methods are widely applied for investigating multi-component spectroscopic data. So far, a number of popular methods have been utilized to achieve this goal: (1) principal component regression (PCR), 1,2 (2) partial least squares (PLS), 3 10 (3) ridge regression, 11 (4) artificial neural networks (ANN), 12,13 and (5) least absolute shrinkage and selection operator (LASSO). 14 Also, a large number of generalized methods or algorithms based on the above-mentioned models have been proposed recently, such as elastic net estimator, 15 moving window partial least squares regression (MWPLS), 16 Bayesian linear regression with variable selection (BLR-VS), 17 and the competitive adaptive reweighted sampling (CARS) method for multivariate calibration. 18 The LASSO method finds a least-square solution with the constraint on the L 1 -norm of the regression coefficient vector. The key variables for LASSO can be selected step by step according to the LARS algorithm. 19 Elastic net is the generalization of LASSO and can be used more widely than LASSO in multi-component analysis due to its advantages over LASSO, for example, the grouping effect, which is introduced in the following section. One of the challenges for analysis of near-infrared spectral data in chemometrics is how to deal with the p.. n problem, where n is the number of the observations and p is the number of the variables (i.e., factors or predictors). Generally speaking, grouped variables should be of concern in large p and small n problems. 20 Chemists often take many factors into consideration in order to measure properties with precision, whereas only a few observations can be obtained due to difficulties met in the experiments. Statisticians advise employing methods such as cross-validation (CV) and group variable methods when the size of a sample is small. Techniques for grouping variables have been studied many times for large-scale predictors Another challenge is multi-collinearity or correlation of the predictors, because traditional multiple linear regression (MLR) fails in such situations. Thus, novel data analysis methods should be explored for this special case. In this paper we focus on addressing near-infrared (NIR) spectroscopic data, which is generally multi-component as well as multi-collinear. By combining elastic net with partial least squares regression, we propose a new algorithm called EN- PLSR to cope with this case. Under the data-driven principle in statistics, the EN-PLSR algorithm would automatically obtain a number of strongly correlated variable groups through elastic net estimation, and subsequently recursive PLS calibration models are built based on these variable groups. The performance for real data sets shows that the EN-PLSR algorithm has good prediction accuracy and is extraordinarily suitable for strongly correlated data. The remainder of the paper is organized as follows: the Theory section offers the basic theory including variable selection and the grouping effect of the elastic net. The Algorithm section gives the details of the EN-PLSR algorithm. The Experimental and Results sections provide a description of the data sets used and give the experimental results and discussion, respectively. Finally, we give a summary and further discussion. THEORY Variable Selection. Let y ¼ (y 1, y 2,..., y n ) T be the response vector and X ¼ (x 1, x 2,..., x p ) be the predictor matrix, where x j ¼ (x 1j, x 2j,..., x nj ) T ( j ¼ 1, 2,..., p) are p predictors. For simplicity, we also suppose that the data are standardized to have zero mean and unit length for every variable. Namely, Received 19 July 2010; accepted 12 January * Author to whom correspondence should be sent. qsxu@mail. csu.edu.cn. DOI: / X n i¼1 y i ¼ 0; X n i¼1 x ij ¼ 0; X n i¼1 x 2 ij ¼ 1; ð j ¼ 1; 2;... pþ ð1þ 402 Volume 65, Number 4, 2011 APPLIED SPECTROSCOPY /11/ $2.00/0 Ó 2011 Society for Applied Spectroscopy

2 (see Fig. 1) and the optimal point of LASSO is exactly located in the variable coordinate frame (for example, point A in Fig. 1). The Grouping Effect of Elastic Net. Elastic net, 15 an improved version of LASSO for analyzing high-dimensional data, is defined as follows: n o ˆb ¼ð1þk 2 Þ arg min jjy Xbjj 2 2 þ k 2jjbjj 2 2 þ k 1jjbjj 1 b ð4þ FIG. 1. Two-dimensional LASSO penalty (dashed) and elastic net penalty (solid). LASSO penalty field is convex and elastic net penalty field is strictly convex. In multivariate calibration, one usually considers the following linear regression: y ¼ Xb þ e ¼ x 1 b 1 þ x 2 b 2 þx p b p þ e where b ¼ (b 1, b 2,..., b p ) T is the regression coefficient vector. e is usually the Gauss noise, namely, e ; N(0, r 2 I), where I is the identity matrix. Generally, the number of predictors p is very large compared with the number of observations n. Thus, chemists pay more attention to seeking a sparse solution in the application for the sake of scientific insight into the predictor response relationship. Sparse means that a large proportion of elements of b are zero. The traditional method of solving the model given by Eq. 2 is to minimize the residual sum of squares by ordinary least squares (OLS). However, the OLS solutions are generally not sparse. Tibshirani proposed LASSO 14 by using L 1 -norm penalization on the ordinary least squares. The LASSO estimator solves the following optimal objective: subject to ð2þ ˆb ¼ arg min b jjy Xbjj 2 2 ð3aþ where k 1 and k 2 are two non-negative parameters. If k 2 ¼ 0, elastic net is exactly equivalent to LASSO. The scale factor (1 þ k 2 ) should be changed to [1 þ (k 2 /n)] when the variables are not standardized to have mean zero and L 2 -norm one. As shown in Eq. 4, elastic net estimation can be viewed as a penalized least squares methods. The penalty k 2 jjbjj 2 2 þ k 1jjbjj 1 is the convex combination of the LASSO and ridge penalties. Elastic net is the generalization of LASSO, so it can automatically perform variable selection. On the other hand, elastic net estimation can be employed to deal with data with large p and small n, which often appears when carrying out multi-component spectral analysis. It can even select all the p variables in the extreme, whereas LASSO can at most select n predictors. Another important property of elastic net is the grouping effect. Generally speaking, if the regression coefficients of the strongly correlated variables t to be equal (in the absolute value sense), the regression model exhibits a grouping effect. Particularly, when two variables are exactly identical, their coefficients in the regression model should be identical. Hui Zou et al. have pointed out that two completely identical predictors have identical coefficients in the model of Eq. 4 when k 2. 0 due to the strictly convex elastic net penalty k 2 jjbjj 2 2 þ k 1jjbjj 1 (see Fig. 1). Suppose that the predictors V i and V j are strongly correlated. According to the grouping effect of elastic net, the regression coefficients b i and b j should t to be similar, namely, synchronously zero or non-zero. In other words, the predictors V i and V j are synchronously selected or synchronously eliminated. Further assuming that the predictor V j is the most correlated variable with the predictor V i among all the predictors, and in the current step the predictor added to the model of Eq. 2 is V i (i.e., its regression coefficient b i varies from zero to non-zero), then theoretically, the variable added to the model of Eq. 2 in the next step ts to be predictor V j. This is the basis of the grouping strategy of the EN-PLSR algorithm for application to spectroscopic data. Generally speaking, in the full wavelength range of spectroscopic data, two successive channels (predictors) are strongly correlated. Thus, they should be in or out of the model (Eq. 2) together. X p j¼1 jb j jt ð3bþ where t is the non-negative tuning parameter. The LASSO can automatically perform regression and variable selection simultaneously. It shrinks the regression coefficients of the unimportant predictors to exactly zero by controlling the parameter t. This is because the L 1 -norm constraint R p j¼1 jb jjt is convex and singular at the origin 25 FIG. 2. The variable shrinkage by two steps with multi-component and strongly correlated spectroscopic data. APPLIED SPECTROSCOPY 403

3 FIG. 3. The curves of RMSECV with full-spectrum PLS, MWPLS, and EN-PLSR methods for m5psec combined with the four responses of the corn data. Furthermore, they are still successive in the solution sequence of elastic net. EN-PLSR ALGORITHM Motivation and Aim. Generally speaking, we often use prediction accuracy to judge whether a regression model is a good one. But in the large p and small n case, parsimony should also be considered: simpler models are preferred for the sake of scientific insight into the data structure and alleviation of the computational burden. In addition, shrinking the model can often improve the prediction accuracy. 26 So the method of reducing the model complexity is extremely important. The single-variable forward stepwise selection or backward stepwise elimination method is not a good strategy when the number of predictors is ultra-high. 27 Group variable selection is naturally proposed in this situation. In this work, we propose a group variable selection method that combines elastic net and TABLE I. The results for m5spec associated with four responses: moisture, oil, protein, and starch from the corn data set. No.LVs and No.Var are the number of components of the PLS and the number of selected variables, respectively. Method Moisture Oil Protein Starch PLS RMSECV No.LVs No.Var MWPLS RMSECV No.LVs No.Var EN-PLSR RMSECV No.LVs No.Var 40 a 69 b 90 c 80 d a k 2 ¼ 0.5, s ¼ b k 2 ¼ 3000, s ¼ c k 2 ¼ 200, s ¼ d k 2 ¼ 100, s ¼ Volume 65, Number 4, 2011

4 TABLE II. The selected key predictors and the corresponding wavelength regions for the corn data set. Method Response Selected predictors Selected wavelength regions (nm) PLS moisture oil protein starch MWPLS moisture , , , , oil , , 2236 protein 5 59, , , , , , , , , , starch 1 93, , , , , , , , EN-PLSR moisture , , oil , , , , protein , , starch , , , , , , PLSR. First, the predictors are preliminarily shrunk and the high collinear predictors that are kept are automatically grouped according to the grouping effect of elastic net. Then, a leave-one-group-out strategy is followed to further shrink the variables. More details are given below. Algorithm Description. The EN-PLSR algorithm mainly contains two steps. First, elastic net estimation is employed to shrink the predictors. All the predictors are selected automatically by elastic net estimation and a variable sequence is obtained. Then a portion of the variables (e.g., 40%, which can be decided by cross-validation) are selected as key variables. We have noted that the key variables selected by elastic net are piecewise successive for the collinear spectroscopic data in most cases. Under the constraint that the number of elements of any variable group cannot exceed some constant m (e.g., m ¼ 30), the EN-PLSR algorithm treats every continuous predictor segment as a subgroup in which the predictors are highly correlated for the sake of the grouping effect of elastic net. Then, EN-PLSR builds a PLSR model by using the aboveselected group variables and computes its root mean square error of cross-validation (RMSECV 0 ). The recursive leave-onegroup-out strategy is employed to further shrink the predictors in this step. In other words, a series of PLSR models are built using all the variable groups but one. Their corresponding RMSECV h is calculated and if the minimum of RMSECV h is less than RMSECV 0, the corresponding variable group is left out and the above procedure is repeated. It is worth noting that the three tuning parameters k 1, k 2, and A should be validated beforehand. The pseudo-codes of the EN-PLSR algorithm are supplied in the Appix. The EN-PLSR algorithm can be seen as a two-step variable shrinkage (see Fig. 2). First, some uninformative variables are automatically eliminated by elastic net. Second, the recursive leave-one-group-out strategy further shrinks the variables in terms of PLSR in the RMSECV sense. Tuning Parameters. Three tuning parameters, the number of latent variables A, the L 1 -norm penalty k 1, and the L 2 -norm penalty k 2 (see Eq. 4), must be further evaluated for prediction accuracy. However, k 1 can be replaced by the fraction of the L 1 -norm (s), which is always valued within [0,1]. In this paper, we choose k 2, s, and A as tuning parameters and they are estimated by the traditional ten-fold cross-validation method. In practice, we set a grid of three tuning parameters, for example, k 2 can be set to the following ten values: k 2 ¼f0.01, 0.1, 0.5, 1, 2, 5, 10, 100, 1000, 10000g. The range of s is [0,1] and is equally divided into 100 intervals. For the number of latent variables A, we let it be 1 to 15. Thus, we make use of ¼ grid points to search for the optimal combination of model parameters in terms of RMSECV, which is defined as follows: 8 9 >< RMSECV ¼ >: X K Xh i 2 y i ^y ð kþ i >= k¼1 i¼1 where y i is the measure value of the ith (i ¼ 1, 2,..., n) sample and ^y ð kþ i is the predicted value obtained by leaving the kthfold samples out. EXPERIMENTAL Data Set A: Corn. Data set A is cited from Ref. 28, which consists of 80 samples of corn measured on three different NIR spectrometers. The wavelength range is nm at 2 nm intervals and thus it contains 700 predictors (or variables) measured by three instruments called m5, mp5, and mp6 and correspondingly obtains three predictor matrices called m5spec, mp5spec, and mp6spec, respectively. The predictors are generally strongly correlated. Taking m5spec for example, 93.4% of the variables have correlation coefficients more than 0.92, and 49.4% of the variables have correlation coefficients more than The moisture, oil, protein, and starch values for each of the samples are also included as response variables and stored in the response matrix propvals. In this study, we employ m5spec as the predictor matrix together with four responses to investigate the performance of the EN-PLSR algorithm. Data Set B: Gasoline. The second data set, from Ref. 29, is another NIR spectral data set with NIR spectra and octane numbers of 60 gasoline samples. The NIR spectra were measured using diffuse reflection as log (1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401 wavelengths (variables). RESULTS AND DISCUSSION Data Set A: Corn. In this experiment, we used the m5spec as the predictor matrix together with the four responses to correspondingly build four models for investigating the EN-PLSR algorithm. Each predictor and response was first standardized to have zero mean and unit length. The maximal number of each variable subgroup was set to 30. The results of full-spectrum PLS and MWPLS 16 are also shown for the purpose of comparisons. We use Fig. 3 and Table I to describe the RMSECV for the four responses. We set the upper limit of the number of latent variables (LVs) to be 15. The RMSECV decreases as the n >; 1=2 ð5þ APPLIED SPECTROSCOPY 405

5 FIG. 4. EN-PLSR algorithm on m5spec of corn data set A with four responses. The finally selected variable groups (selected regions) are (a) nm and nm; (b) nm, nm, and nm; (c) nm and nm; and (d) nm, nm, nm, and nm. number of LVs increases. The minimal point for judging how many LVs should be chosen is not obvious or does not exist for the four responses. Large LVs is not a wise choice, so we use the standard introduced in Ref. 8: a component is judged significant if the ratio RMSECV a /RMSECV a 1 is smaller than 0.95 in this experiment, where RMSECV a is the RMSECV obtained by choosing a (a ¼ 2,..., 15) LVs. According to this criterion, the numbers of LVs for the model are, respectively, 14, 9, 12, and 10 for the four responses. Obviously, the EN- PLSR algorithm has the lowest RMSECV compared with the other two methods (see Table I). For the responses moisture and protein, RMSECV decreases faster with the increase of LVs than for the other two methods (see Fig. 3). The selected wavelength regions and the selected predictors for the four responses are shown in Table II and Fig. 4. Compared to MWPLS, the number of selected variables is much smaller for EN-PLSR for each response. This is very important in practice because the final regression model is more parsimonious. It is also interesting to note that some of the wavelength bands selected by the EN-PLSR algorithm are chemically meaningful. For example, the cm 1 band should be ascribed to water absorption 30 and the combination of the O H bond. 16 As is well known, band assignment is very difficult because, to the best of our knowledge, each wavelength of an NIR spectrum is influenced by a number of compounds and/or different vibration modes. However, we also found that some of wavelength bands selected by EN-PLSR are chemically meaningful according to Table 1 in the appix of Ref. 31. Data Set B: Gasoline. We employed the same data 406 Volume 65, Number 4, 2011

6 TABLE III. RMSECV, selected key predictors, and corresponding wavelength regions for the gasoline data set. Method RMSECV No. LVs No. Var Selected predictors Selected wavelength region (nm) PLS MWPLSR , , EN-PLSR a , , , , , , a k 2 ¼ 200, s ¼ , , , , , , , , We note that the variables in each group are strongly correlated, which indirectly supports our algorithm. FIG. 5. The EN-PLSR algorithm on gasoline data set B. The finally selected variable groups are , , , , , , and nm. FIG. 6. The curves of RMSECV with full-spectrum PLS, MWPLS, and EN- PLSR methods for the gasoline data set B. pretreatment and criterion for data set B as we did for data set A in this experiment, but we chose the optimal number of LVs corresponding to the minimum RMSECV. Seven variable groups with 69 predictors in total were finally selected as key variables (see Fig. 5). The RMSECV of full-spectrum PLS, MWPLS, and EN-PLSR methods are described in Fig. 6 and Table III. The RMSECV values of MWPLS and EN-PLSR are close; however, the number of key variables selected via EN- PLSR is much smaller than the number selected by MWPLS. CONCLUSIONS Variable selection for multi-collinear as well as strongly correlated data with small observations and large predictors is an issue of importance in chemometrics, especially in spectroscopic analysis. We propose in this paper a new method called EN-PLSR to deal with multi-collinear NIR spectroscopic data. The EN-PLSR algorithm is a variable grouping (or interval) selection method that focuses on automatically subgrouping the candidate variables and obtaining a parsimonious solution model while as much as possible retaining all the relevant information. The predictors in every group selected by the EN-PLSR algorithm are successive and strongly correlated. In the Experimental section, we employed NIR spectroscopic data to evaluate our algorithm, but the EN-PLSR methods can be applied to a wide variety of fields, for example, ultraviolet visible (UV-Vis) and mid-infrared (MIR) spectroscopic data analysis. In order to obtain the optimal solution, three tuning parameters, the L 1 and L 2 -norm penalty coefficients in elastic net, as well as the number of components in PLSR, should be carefully validated. ACKNOWLEDGMENTS I would like to express my gratitude to Prof. Rolf Manne, University of Bergen, Norway for his advice on this paper. This work was financially supported by the National Natural Science Foundation of P.R. China (Grants No and No ), and the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680). The studies meet with the approval of the university s review board. 1. T. Næs and H. Martens, J. Chemom. 2, 155 (1988). 2. M. K. Hartnett, G. Lightbody, and G. W. Irwin, Chemom. Intell. Lab. Syst. 40, 215 (1998). 3. P. Geladi and B. R. Kowalski, Anal. Chim. Acta 185, 1 (1986). 4. A. Höskuldsson, J. Chemom. 2, 211 (1988). 5. F. Lindgren, P. Geladi, and S. Wold, J. Chemom. 7, 45 (1993). 6. S. De Jong and C. J. F. ter Braak, J. Chemom. 8, 169 (1994). 7. B. S. Dayal and John F. Macgregor, J. Chemom. 11, 73 (1997). 8. S. Wold, M. Sjöström, and L. Eriksson, Chemom. Intell. Lab. Syst. 58, 109 (2001). 9. Q.-S. Xu, Y.-Z. Liang, and H.-L. Shen, J. Chemom. 15, 135 (2001). 10. H. Abdi, Wiley Interdisc. Rev.: Comput. Statist. 2, 97 (2010). 11. E. Vigneau, M. F. Devaux, E. M. Qannari, and P. Robert, J. Chemom. 11, 239 (1997). APPLIED SPECTROSCOPY 407

7 12. F. Despagne, D.-L. Massart, and P. Chabot, Anal. Chem. 72, 1657 (2000). 13. H. H. Thodberg, IEEE Trans. Neural Networks 7, 56 (1996). 14. R. Tibshirani, J. R. Statist. Soc. B 58, 267 (1996). 15. H. Zou and T. Hastie, J. Roy. Statist. Soc., Ser. B 67, 301 (2005). 16. J.-H. Jiang, R. J. Berry, H. W. Siesler, and Y. Ozaki, Anal. Chem. 74, 3555 (2002). 17. T. Chen and E. Martin, Anal. Chim. Acta 631, 13 (2009). 18. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, Anal. Chim. Acta 648, 77 (2009). 19. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Annal. Statist. 32, 407 (2004). 20. M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, Jr., J. R. Marks, and J. R. Nevins, Proc. Natl. Acad. Sci. U.S.A. 98, (2001). 21. T. Hastie, R. Tibshirani, M. Eisen, A. Alizadeh, D. Ross, U. Scherf, J. Weinstein, R. Levy, L. Staudt, W. C Chan, D. Botstein, and P. Brown, Genome Biol. 1, 1 (2000). 22. R. Daz-Uriarte, Technical Report, Spanish National Cancer Center (2003). 23. M. Dettling and P. Bühlmann, J. Multivariate Anal. 90, 106 (2004). 24. M. R. Segal, K. D. Dahlquist, and B. R. Conklin, J. Comput. Biol. 10, 961 (2003). 25. J. Fan and R. Li, J. Am. Statist. Assoc. 96, 1348 (2001). 26. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Leraning (Springer-Verlag, NY, 2001). 27. J. Fan and J. Lv, J. Royal Statist. Soc.: Ser. B (Statistical Methodology) 70, 849 (2008) J. H. Kalivas, Chemom. Intell. Lab. Syst. 37, 255 (1997). 30. D. Jouan-Rimbaud, D.-L. Massart, R. Leardi, and O. E. DeNoord, Anal. Chem. 67, 4295 (1995). 31. H. W. Siesler, Y. Ozaki, S. Kawata, and H. M. Heise, Eds., Near Infrared Spectroscopy: Principals, Instruments, Applications (Weinheim, Germany, Wiley-VCH, 2002). APPENDIX The Pseudo-codes of the EN-PLSR Algorithm. Input: Predictor matrix X ¼ (x 1, x 2,..., x p ), response vector y 2 R n31, d ¼ 1, m ¼ 30, RMSEV h ¼ 0. Using elastic net to select variable sequence, their indices denoted by a 1, a 2,..., a p1 by order (p 1, p); G d ¼fx a1 g for i ¼ 2top 1 do if (a i a i 1 ¼ 1 and card(g d ), m) then G d ¼fG d, x ai g; else d ¼ d þ 1, G d ¼fx a1 g; Compute RMSECV 0 of PLSR model by all d groups, d ¼ d þ 1; While RMSECV h, RMSECV 0 do Leave G h out from d groups; d ¼ d 1, RMSECV 0 ¼ RMSECV h ; for j ¼ 1tod do Compute RMSECV j of PLSR model by G 1,..., G j 1, G jþ1,..., G d RMSECV h ¼ arg minfrmsev 1,..., RMSECV d g Output: RMSECV 0, fg 1, G 2,..., G d g. Note, card(g d ) is the number of elements of the set G d. 408 Volume 65, Number 4, 2011

MS-C1620 Statistical inference

MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents