Multivariate Calibration with Robust Signal Regression

Size: px

Start display at page:

Download "Multivariate Calibration with Robust Signal Regression"

Constance Thornton
5 years ago
Views:

1 Multivariate Calibration with Robust Signal Regression Bin Li and Brian Marx from Louisiana State University Somsubhra Chakraborty from Indian Institute of Technology Kharagpur David C Weindorf from Texas Tech University. July 31, /22

2 Outline Motivating example. Recap: Penalized Signal Regression (PSR). Generalized Huber Loss and robust PSR. Simulation and empirical results. Related issues. 2/22

3 A Soil Data Example Data: 675 soil samples collected from CA, NE, and TX in 2014, 225 samples in each location. All the soil samples were scanned using a portable VisNIR spectroradiometer with a spectral range of 350 to 2500 nm. Ten physicochemical properties were measured: soil cation exchange capacity (CEC), total nitrogen level, electrical conductivity (EC), total carbon level, loss on ignition (LOI), soil organic matter (SOM), clay, sand, silt, and soil ph level. LOI and SOM are highly correlated, so LOI was removed. Objective: use VisNIR spectra to predict soil properties. 3/22

4 Sample spectra Wavelength (nm) Thirty sample spectra (first derivative) for the soil data. 4/22

5 Penalized Signal Regression PSR: P. Eilers and B.D. Marx (Statistical Science, 1996). PSR minimizes the following objective: S(α) = y XBα 2 +λ Dα 2, with difference matrix D penalizes differences on α. The closed form solution for α: ˆα = (U U +λd D) 1 U y Response Y: soil property indicator (m 1 column vector, m = 675). Input X: VisNIR Spectra m p matrix, p = 214. B-spline basis matrix B: p n matrix, n = 100. Difference matrix D: (n d) n, d is the order of the difference penalty (d = 0,1,2,3). Coefficient vector α: n 1. 5/22

6 Q-Q Plot of PSR Residuals Sample Quantiles CEC Theoretical Quantiles Carbon Sample Quantiles EC Theoretical Quantiles LOI Sample Quantiles Nitrogen Theoretical Quantiles SOM Sample Quantiles Sample Quantiles Theoretical Quantiles Clay Theoretical Quantiles Sample Quantiles Sample Quantiles Theoretical Quantiles Sand Theoretical Quantiles Sample Quantiles Sample Quantiles Theoretical Quantiles Silt Theoretical Quantiles Normal quantile-quantile plots of the residuals (from PSR models) for nine soil property indicators. 6/22

7 With vs. Without Outliers on PSR and Robust PSR PSR PSR Coefficient with outliers w/o outliers Predicted Wave length (nm) Measured rpsr rpsr Coefficient with outliers w/o outliers Predicted Wave length (nm) Measured 7/22

8 Generalized Huber Loss Generalized Huber loss { e 2 e < K ρ η (e) = K 2 +2ηK( e K) e K., 0 η 1. ρ(e) η = 1 η = 0.5 η = 0 e 2 ρ(e) η = 1 η = 0.5 η = e e 8/22

9 Robust Penalized Signal Regression (rpsr) The rpsr estimator minimizes { m } Q(α) = ρ η (y i U iα) +λα D d D dα i=1 which can be represented as a difference of two convex functions as follows: Q(α) = h 1 (α) h 2 (α), where m h 1 (α) = ei 2 +λα D Dα, h 2 (α) = i=1 m I( e i > K) [ ei 2 +2ηK(K e i ) K 2], i=1 9/22

10 Difference Convex Programming Difference Convex Programming: An and Tao (1997). Consider minimizing a nonconvex objective function g(w) = g 1 (w) g 2 (w) where both g 1 (w) and g 2 (w) are convex in w. D.C. programming constructs a sequence of subproblems and solves them iteratively. Given the solution for the (m 1)th subproblem w (m 1), the mth subproblem solves w (m) [ ] = arg min g 1(w) g 2 (w (m 1) ) + w w (m 1), g 2 (w (m 1) ), w = arg min w g 1(w) w, g 2 (w (m 1) ). where g 2 (w (m 1) ) is the subgradient of g 2 (w) at w (m 1) with respect to w. 10/22

11 Robust PSR Algorithm Minimizing the objective function of rpsr becomes minimizing a sequence of PSR with the adjusted responses Y A y 1 I( e 1 > K)[e 1 ηksign(e 1 )] Y A =.. y m I( e m > K)[e m ηksign(e m )] m 1 Only the observations with the residuals greater than K (in absolute value) will be adjusted. If K is greater than all the residuals {e i }, then rpsr and PSR solutions are the same. 11/22

12 Robust PSR Algorithm (cont.) Initial ˆα is from the PSR estimate (with the same value of λ). Algorithm stops when max{ (ˆα j cur ˆα pre j )/ˆα pre j } n j=1 < The cutoff value K is chosen based on 1.5 IQR rule on the residuals in each iteration. Grid search on the optimal values for λ and η based on CV performance. The rpsr algorithm usually converges within just a few iterations. 12/22

13 Simulation Studies Underlying model: Y i = f(x i )+ǫ i. f(x i ): PSR fitted value on CEC with λ = Three error distributions on ǫ i : Normal: ei N(0, ). Mixed normal: ei 0.95N(0, )+0.05N(0, ). Slash distribution: ei N(0,1)/U(0,1). Three levels of η are considered: 0, 0.5 and fold CV to find optimal value of λ. 50 random splits of the datasets: 75% training and 25% test sets. Comparative RMSE and MAE on test samples. 13/22

14 Simulation Results Normal Mixed normal Slash Comparative RMSE PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 Normal Comparative MAE Comparative RMSE Mixed normal Comparative MAE Comparative RMSE PSR rpsr_1 rpsr_0.5 rpsr_0 Slash Comparative MAE PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 PSR rpsr_1 rpsr_0.5 rpsr_0 14/22

15 Simulation Results (cont.) Average of test RMSEs and MAEs based on 50 replications. RMSE PSR rpsr (η = 1) rpsr (η = 0.5) rpsr (η = 0) Normal Mixed Slash MAE PSR rpsr (η = 1) rpsr (η = 0.5) rpsr (η = 0) Normal Mixed Slash /22

16 Model Stability Three error distributions as above: normal, mixed normal and slash. Three levels of η are considered: 0, 0.5 and 1. PSR and rpsr are fitted on 95% of the random sample with λ = random splits of the datasets. Evaluate model stability on coefficient estimation: L 2 distance standard deviation (L 2 DSD) criterion L 2 DSD = SD({ ˆβ (i) β 2 } 20 i=1), where β is the average ˆβ from 20 replications. Evaluate model stability on prediction: SD of the predicted values on all 675 samples. 16/22

17 Simulation Results (cont.) Summary of L 2 DSD of ˆβ and SD of ŷ based on 20 replications. Error PSR rpsr (η = 1) η = 0.5 η = 0 Normal Mixed Slash Normal Mixed Slash /22

18 Soil Data Study 50 random splits: 75% training and 25% test sets. Three levels of η are considered: 0, 0.5 and 1. RMSE on test samples. Error PSR rpsr (η = 1) η = 0.5 η = 0 CEC EC Nitrogen Carbon SOM Clay Sand Silt ph /22

19 Soil Data Study (cont.) MAE on test samples. Error PSR rpsr (η = 1) η = 0.5 η = 0 CEC EC Nitrogen Carbon SOM Clay Sand Silt ph /22

20 Soil Data Study (cont.) Compare PSR and rpsr coefficients and identify the outliers. Data: use all 675 samples as training set; Y: Carbon. PSR vs. rpsr model with η = 0.5 Leading two PCs explains 79.8% total variance. rpsr identifies 17 outliers (about 2.5% of total samples). PC NE CA TX Predicted Carbon Coefficient rpsr PSR PC Measured Carbon Wavelength (nm) 20/22

21 Connection With Lee and Oh s Procedure (2007) Lee and Oh (2007) explored robust penalized regression spline using Huber loss. They proposed an iterative fitting procedure basesd on the pseudo-response ỹ: ỹ i = ŷ i + ψ(e i) 2, where ψ( ) is the first order derivative of Huber loss ρ H ( ). We can show the pseudo-response ỹ i is equivalent to our adjusted response Y A with η = 1: ỹ i = y i I( e i > K)[e i ηksign(e i )]. Lee and Oh s approach is theoretically supported by Cox s result (1983), which requires ψ( ) to be 2 nd order differentiable. The proposed rpsr procedure is a generalization of Lee and Oh s procedure, and motivated from a different perspective. 21/22

22 Reference An, L. and Tao, P. (1997). Sovling a class of linearly constrained indefinite quadratic problems by d.c. algorithms, Journal of Global Optimization, 11, Cox, D. (1983). Asymptotics for m-type smoothing splines, The Annals of Statistics, 11(2), Eilers, P. and Marx, B. (1996). Flexible smoothing with b-splines and penalties, Statistical Science, 11(2), Lee, T. and Oh, H. (2007). Robust penalized regression spline fitting with application to additive mixed modeling, Computational Statistics, 22(1), Li, B. and Marx, B. (2018), Multivariate calibration with robust signal regression. Accepted In Statistical Modelling: An International Journal. 22/22

Functional SVD for Big Data

Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem