Multivariate Calibration with Robust Signal Regression Bin Li and Brian Marx from Louisiana State University Somsubhra Chakraborty from Indian Institute of Technology Kharagpur David C Weindorf from Texas Tech University. July 31, 2018 1/22
Outline Motivating example. Recap: Penalized Signal Regression (PSR). Generalized Huber Loss and robust PSR. Simulation and empirical results. Related issues. 2/22
A Soil Data Example Data: 675 soil samples collected from CA, NE, and TX in 2014, 225 samples in each location. All the soil samples were scanned using a portable VisNIR spectroradiometer with a spectral range of 350 to 2500 nm. Ten physicochemical properties were measured: soil cation exchange capacity (CEC), total nitrogen level, electrical conductivity (EC), total carbon level, loss on ignition (LOI), soil organic matter (SOM), clay, sand, silt, and soil ph level. LOI and SOM are highly correlated, so LOI was removed. Objective: use VisNIR spectra to predict soil properties. 3/22
Sample spectra 0.04 0.02 0.00 500 1000 1500 2000 2500 Wavelength (nm) Thirty sample spectra (first derivative) for the soil data. 4/22
Penalized Signal Regression PSR: P. Eilers and B.D. Marx (Statistical Science, 1996). PSR minimizes the following objective: S(α) = y XBα 2 +λ Dα 2, with difference matrix D penalizes differences on α. The closed form solution for α: ˆα = (U U +λd D) 1 U y Response Y: soil property indicator (m 1 column vector, m = 675). Input X: VisNIR Spectra m p matrix, p = 214. B-spline basis matrix B: p n matrix, n = 100. Difference matrix D: (n d) n, d is the order of the difference penalty (d = 0,1,2,3). Coefficient vector α: n 1. 5/22
Q-Q Plot of PSR Residuals Sample Quantiles 10 0 5 10 CEC 3 2 1 0 1 2 3 Theoretical Quantiles Carbon Sample Quantiles 500 500 1500 EC 3 2 1 0 1 2 3 Theoretical Quantiles LOI Sample Quantiles 0.10 0.00 0.10 Nitrogen 3 2 1 0 1 2 3 Theoretical Quantiles SOM Sample Quantiles Sample Quantiles 0.5 0.5 1.0 30 10 10 3 2 1 0 1 2 3 Theoretical Quantiles Clay 3 2 1 0 1 2 3 Theoretical Quantiles Sample Quantiles Sample Quantiles 0 2 4 6 40 20 0 3 2 1 0 1 2 3 Theoretical Quantiles Sand 3 2 1 0 1 2 3 Theoretical Quantiles Sample Quantiles Sample Quantiles 1 0 1 2 3 4 10 0 10 20 30 3 2 1 0 1 2 3 Theoretical Quantiles Silt 3 2 1 0 1 2 3 Theoretical Quantiles Normal quantile-quantile plots of the residuals (from PSR models) for nine soil property indicators. 6/22
With vs. Without Outliers on PSR and Robust PSR PSR PSR Coefficient 3000 1000 0 1000 3000 with outliers w/o outliers 500 1000 1500 2000 2500 Predicted 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Wave length (nm) Measured rpsr rpsr Coefficient 3000 1000 1000 3000 with outliers w/o outliers Predicted 5 10 15 20 25 30 35 500 1000 1500 2000 2500 5 10 15 20 25 30 35 Wave length (nm) Measured 7/22
Generalized Huber Loss Generalized Huber loss { e 2 e < K ρ η (e) = K 2 +2ηK( e K) e K., 0 η 1. ρ(e) 0 5 10 15 η = 1 η = 0.5 η = 0 e 2 ρ(e) 0 2 4 6 8 10 12 14 η = 1 η = 0.5 η = 0 4 2 0 2 4 e 4 2 0 2 4 e 8/22
Robust Penalized Signal Regression (rpsr) The rpsr estimator minimizes { m } Q(α) = ρ η (y i U iα) +λα D d D dα i=1 which can be represented as a difference of two convex functions as follows: Q(α) = h 1 (α) h 2 (α), where m h 1 (α) = ei 2 +λα D Dα, h 2 (α) = i=1 m I( e i > K) [ ei 2 +2ηK(K e i ) K 2], i=1 9/22
Difference Convex Programming Difference Convex Programming: An and Tao (1997). Consider minimizing a nonconvex objective function g(w) = g 1 (w) g 2 (w) where both g 1 (w) and g 2 (w) are convex in w. D.C. programming constructs a sequence of subproblems and solves them iteratively. Given the solution for the (m 1)th subproblem w (m 1), the mth subproblem solves w (m) [ ] = arg min g 1(w) g 2 (w (m 1) ) + w w (m 1), g 2 (w (m 1) ), w = arg min w g 1(w) w, g 2 (w (m 1) ). where g 2 (w (m 1) ) is the subgradient of g 2 (w) at w (m 1) with respect to w. 10/22
Robust PSR Algorithm Minimizing the objective function of rpsr becomes minimizing a sequence of PSR with the adjusted responses Y A y 1 I( e 1 > K)[e 1 ηksign(e 1 )] Y A =.. y m I( e m > K)[e m ηksign(e m )] m 1 Only the observations with the residuals greater than K (in absolute value) will be adjusted. If K is greater than all the residuals {e i }, then rpsr and PSR solutions are the same. 11/22
Robust PSR Algorithm (cont.) Initial ˆα is from the PSR estimate (with the same value of λ). Algorithm stops when max{ (ˆα j cur ˆα pre j )/ˆα pre j } n j=1 < 10 6. The cutoff value K is chosen based on 1.5 IQR rule on the residuals in each iteration. Grid search on the optimal values for λ and η based on CV performance. The rpsr algorithm usually converges within just a few iterations. 12/22
Simulation Studies Underlying model: Y i = f(x i )+ǫ i. f(x i ): PSR fitted value on CEC with λ = 10 5. Three error distributions on ǫ i : Normal: ei N(0,2.39 2 ). Mixed normal: ei 0.95N(0,2.39 2 )+0.05N(0,23.9 2 ). Slash distribution: ei N(0,1)/U(0,1). Three levels of η are considered: 0, 0.5 and 1. 10-fold CV to find optimal value of λ. 50 random splits of the datasets: 75% training and 25% test sets. Comparative RMSE and MAE on test samples. 13/22
Simulation Results Normal Mixed normal Slash Comparative RMSE 1.00 1.05 1.10 1.15 PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 Normal Comparative MAE 1.00 1.05 1.10 1.15 1.20 1.25 Comparative RMSE 1.0 1.5 2.0 2.5 Mixed normal Comparative MAE 1.0 1.5 2.0 2.5 3.0 3.5 Comparative RMSE 0 20 40 60 80 100 120 PSR rpsr_1 rpsr_0.5 rpsr_0 Slash Comparative MAE 0 50 100 150 PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 14/22
Simulation Results (cont.) Average of test RMSEs and MAEs based on 50 replications. RMSE PSR rpsr (η = 1) rpsr (η = 0.5) rpsr (η = 0) Normal 0.696 0.694 0.699 0.709 Mixed 1.422 0.897 0.820 0.800 Slash 13.569 1.728 1.452 1.310 MAE PSR rpsr (η = 1) rpsr (η = 0.5) rpsr (η = 0) Normal 0.440 0.437 0.442 0.446 Mixed 0.885 0.545 0.508 0.502 Slash 7.646 1.022 0.850 0.792 15/22
Model Stability Three error distributions as above: normal, mixed normal and slash. Three levels of η are considered: 0, 0.5 and 1. PSR and rpsr are fitted on 95% of the random sample with λ = 10 5. 20 random splits of the datasets. Evaluate model stability on coefficient estimation: L 2 distance standard deviation (L 2 DSD) criterion L 2 DSD = SD({ ˆβ (i) β 2 } 20 i=1), where β is the average ˆβ from 20 replications. Evaluate model stability on prediction: SD of the predicted values on all 675 samples. 16/22
Simulation Results (cont.) Summary of L 2 DSD of ˆβ and SD of ŷ based on 20 replications. Error PSR rpsr (η = 1) η = 0.5 η = 0 Normal 202 200 191 186 Mixed 1247 402 359 377 Slash 2495 442 416 512 Normal 0.126 0.126 0.135 0.148 Mixed 0.361 0.150 0.148 0.152 Slash 0.728 0.275 0.224 0.237 17/22
Soil Data Study 50 random splits: 75% training and 25% test sets. Three levels of η are considered: 0, 0.5 and 1. RMSE on test samples. Error PSR rpsr (η = 1) η = 0.5 η = 0 CEC 2.622 2.590 2.595 2.624 EC 290.2 286.7 287.9 290.1 Nitrogen 0.01848 0.01815 0.01806 0.01818 Carbon 0.1817 0.1794 0.1782 0.1795 SOM 0.3383 0.3284 0.3269 0.3288 Clay 3.240 3.117 3.138 3.204 Sand 5.425 5.362 5.397 5.479 Silt 4.473 4.412 4.413 4.438 ph 0.3740 0.3729 0.3731 0.3782 18/22
Soil Data Study (cont.) MAE on test samples. Error PSR rpsr (η = 1) η = 0.5 η = 0 CEC 1.795 1.749 1.737 1.747 EC 211.0 206.5 205.5 205.8 Nitrogen 0.01265 0.01240 0.01227 0.01232 Carbon 0.1313 0.1300 1.287 1.290 SOM 0.2065 0.1946 0.1924 0.1933 Clay 2.233 2.105 2.101 2.124 Sand 4.011 3.940 3.946 3.977 Silt 3.325 3.291 3.289 3.302 ph 0.2842 0.2828 0.2821 0.2855 19/22
Soil Data Study (cont.) Compare PSR and rpsr coefficients and identify the outliers. Data: use all 675 samples as training set; Y: Carbon. PSR vs. rpsr model with η = 0.5 Leading two PCs explains 79.8% total variance. rpsr identifies 17 outliers (about 2.5% of total samples). PC2 0.01 0.00 0.01 0.02 NE CA TX Predicted Carbon 0 1 2 3 4 5 6 Coefficient 200 100 0 100 200 rpsr PSR 0.04 0.02 0.00 0.02 0.04 PC1 0 1 2 3 4 5 6 Measured Carbon 500 1000 1500 2000 2500 Wavelength (nm) 20/22
Connection With Lee and Oh s Procedure (2007) Lee and Oh (2007) explored robust penalized regression spline using Huber loss. They proposed an iterative fitting procedure basesd on the pseudo-response ỹ: ỹ i = ŷ i + ψ(e i) 2, where ψ( ) is the first order derivative of Huber loss ρ H ( ). We can show the pseudo-response ỹ i is equivalent to our adjusted response Y A with η = 1: ỹ i = y i I( e i > K)[e i ηksign(e i )]. Lee and Oh s approach is theoretically supported by Cox s result (1983), which requires ψ( ) to be 2 nd order differentiable. The proposed rpsr procedure is a generalization of Lee and Oh s procedure, and motivated from a different perspective. 21/22
Reference An, L. and Tao, P. (1997). Sovling a class of linearly constrained indefinite quadratic problems by d.c. algorithms, Journal of Global Optimization, 11, 253-285. Cox, D. (1983). Asymptotics for m-type smoothing splines, The Annals of Statistics, 11(2), 530-551. Eilers, P. and Marx, B. (1996). Flexible smoothing with b-splines and penalties, Statistical Science, 11(2), 89-121. Lee, T. and Oh, H. (2007). Robust penalized regression spline fitting with application to additive mixed modeling, Computational Statistics, 22(1), 159-171. Li, B. and Marx, B. (2018), Multivariate calibration with robust signal regression. Accepted In Statistical Modelling: An International Journal. 22/22