PLS. theoretical results for the chemometrics use of PLS. Liliana Forzani. joint work with R. Dennis Cook

Size: px

Start display at page:

Download "PLS. theoretical results for the chemometrics use of PLS. Liliana Forzani. joint work with R. Dennis Cook"

Arline Dixon
5 years ago
Views:

1 PLS theoretical results for the chemometrics use of PLS Liliana Forzani Facultad de Ingeniería Química, UNL, Argentina joint work with R. Dennis Cook

2 Example in chemometrics A concrete situation could be that y is a chemical variable (protein content or fat content) and x = (x 1,..., x p ) are absorptions or reflectances measured at p different wavelengths using some kind of spectroscopic instrument. We will have available simultaneous measurements of x and y on n chemical samples (the calibration set), and we want to use these measurements to predict y from x measurements on new specimens.

3 Setting in chemometrics Goal: predict a random variable y from p random variables x = (x 1,..., x p ). Statistical model: linear regression with ɛ N(0, σ 2 ɛ ), σ 2 ɛ = σ 2 y σ T Σ 1 σ, σ 2 y = var(y), Σ = cov(x), σ = cov(x, y). y = µ y + β T (x µ x ) + ɛ (1)

4 Setting in chemometrics Goal: predict a random variable y from p random variables x = (x 1,..., x p ). Statistical model: linear regression with ɛ N(0, σ 2 ɛ ), σ 2 ɛ = σ 2 y σ T Σ 1 σ, σ 2 y = var(y), Σ = cov(x), σ = cov(x, y). y = µ y + β T (x µ x ) + ɛ (1) Least square solution for β = Σ 1 σ.

5 Algorithm PLS started as an algorithm to avoid (when n < p) the problem of inverting the covariance matrix Σ to get the estimator of β = Σ 1 σ. Let us call that estimator ˆβ P LS.

6 Algorithm PLS started as an algorithm to avoid (when n < p) the problem of inverting the covariance matrix Σ to get the estimator of β = Σ 1 σ. Let us call that estimator ˆβ P LS. It was set in motion by Herman Wold in the late 1960s to address problems in path modeling, and was adapted in 1977 by Svante Wold for prediction in chemometrics. It was an algorithm, easy to compute that worked pretty well in chemometrics even if p > n.

7 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d).

8 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ.

9 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ

10 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ d? d = p?

11 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ d? d = p? Is there convergence of the algorithm? where?

12 PLS works The chemometrics community was using PLS for calibration since then

13 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms.

14 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going

15 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going Statistic community did not pay attention to PLS (maybe this is why there were no asymptotics)

16 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going Statistic community did not pay attention to PLS (maybe this is why there were no asymptotics) The PLS tradition is perhaps more akin to conventions in machine learning or data science than it is to statistical customs. There is now a vast literature on PLS within chemometrics, some of it refining and extending the methodology and some of it affirming the methodology like the paper PLS works. by Bro and Eldén (2009)

17 Constraint in the parameters. Helland. An statistician appear Helland in 1990 realized that the algorithm was a plugging estimator for β of the form with β PLS = S(S T ΣS) 1 S T σ (2) S = {σ, Σσ,..., Σ d 1 σ} i.e. ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝ T ˆσ (3) As a consequence for p fixed ˆβ PLS consistent estimator of β PLS There was still a mystery about the shape of β PLS from (2).

18 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ.

19 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ. 1 β = cσ 2 Let us recall that β = Σ 1 σ 3 (1) and (2) together Σ 1 σ = cσ or Σσ = cσ, 4 Then, σ is one of the eigenvectors of Σ

20 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ. 1 β = cσ 2 Let us recall that β = Σ 1 σ 3 (1) and (2) together Σ 1 σ = cσ or Σσ = cσ, 4 Then, σ is one of the eigenvectors of Σ Moreover, if d > 1 PLS in the population means that β cuts only d eigenvectors of Σ and therefore β can be envelope by a span of d eigenvectors of Σ

21 More about constraints. MLE Cook, Helland and Zu (2013): informally β only cut a few eigenvectors of Σ. Formally there exists Γ R p u with u p such that the columns of Γ are u eigenvectors of Σ (not necessary the first ones) and β = ΓU for some U R u 1 and since β = Σ 1 σ we have Σ = ΓΛ Γ Γ T + Γ 0 Λ Γ0 Γ T 0 since Γ are eigenvectors of Σ β = Γ(Γ T ΣΓ) 1 Γσ Γ = S?, remember that β PLS = S(S T ΣS) 1 Sσ.

22 More about constraints. MLE Cook, Helland and Zu (2013): informally β only cut a few eigenvectors of Σ. Formally there exists Γ R p u with u p such that the columns of Γ are u eigenvectors of Σ (not necessary the first ones) and β = ΓU for some U R u 1 and since β = Σ 1 σ we have Σ = ΓΛ Γ Γ T + Γ 0 Λ Γ0 Γ T 0 since Γ are eigenvectors of Σ β = Γ(Γ T ΣΓ) 1 Γσ Γ = S?, remember that β PLS = S(S T ΣS) 1 Sσ. They found the MLE for β PLS and show and prove that for p fixed, n. Efficiency.

23 But the Chemometrics community used it for p increasing!!! Setting: p > n. MLE, does not exists. No hope The ˆβ PLS algorithm works if d < min{p, n}, and works (pretty well) when n < p. Recall ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝˆσ.

24 But the Chemometrics community used it for p increasing!!! Setting: p > n. MLE, does not exists. No hope The ˆβ PLS algorithm works if d < min{p, n}, and works (pretty well) when n < p. Recall ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝˆσ. In view of the apparent success that PLS has had in chemometrics and elsewhere, we might anticipate that it has reasonable statistical properties in high-dimensional regression.

25 But the statisticians did show up again

26 But the statisticians did show up again but with bad news

27 But the statisticians did show up again but with bad news Chun and Keles (2010) provided a piece of the puzzle by showing that, within a certain modeling framework, the PLS estimator of the coefficient vector in linear regression is inconsistent unless p/n 0. They then used this as motivation for their development of

28 But the statisticians did show up again but with bad news Chun and Keles (2010) provided a piece of the puzzle by showing that, within a certain modeling framework, the PLS estimator of the coefficient vector in linear regression is inconsistent unless p/n 0. They then used this as motivation for their development of sparse version of PLS.

29 A dilema The Chun- Keles result poses a dilemma.

30 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application.

31 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application. There are several possible explanations for this conflict, including

32 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application. There are several possible explanations for this conflict, including consistency does not always signal the value of a method in practice, the literature is largely wrong about the value of PLS, and the modeling construct used by Chun and Keles does not adequately reflect the range of applications in which PLS is employed.

33 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ).

34 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). As a consequence, x y Θ T x, and thus d linear combinations Θ T x carry all of the information that x has about y.

35 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). As a consequence, x y Θ T x, and thus d linear combinations Θ T x carry all of the information that x has about y. The variance of x can be expressed as Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H, where H = Θ(Θ T Θ) 1/2 is a semi-orthogonal basis matrix for span(θ).

36 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H

37 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H But in spectroscopy data it seems entirely plausible that notable signal comes from many wavelengths, not just a few. When this happens many rows of Θ are non-zero in such a way that p i=1 θ i 2 diverges and we are not in Chun and Kele s assumptions for non-consistency.

38 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H But in spectroscopy data it seems entirely plausible that notable signal comes from many wavelengths, not just a few. When this happens many rows of Θ are non-zero in such a way that p i=1 θ i 2 diverges and we are not in Chun and Kele s assumptions for non-consistency. As a conclusion Chun and Kele s paper effectively imposes sparsity to get non-consistency.

39 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold?

40 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold? In situations with many variables, more than say 50 or 100, there is a strong temptation to drastically reduce the number of variables in the model. This temptation is further strengthened by the regression tradition to reduce the variables as far as possible to get the X matrix well conditioned. As discussed below, however, this reduction of variables often removes information, makes the interpretation misleading and increases the risk of spurious models. An often better alternative than variable reduction is to divide the variables into conceptually meaningful blocks and then apply hierarchical multi-block PLS (or PC) models. These ideas were presented by Wold, Martens and co-workers around 1986, but in rather obscure papers.

41 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold? In situations with many variables, more than say 50 or 100, there is a strong temptation to drastically reduce the number of variables in the model. This temptation is further strengthened by the regression tradition to reduce the variables as far as possible to get the X matrix well conditioned. As discussed below, however, this reduction of variables often removes information, makes the interpretation misleading and increases the risk of spurious models. An often better alternative than variable reduction is to divide the variables into conceptually meaningful blocks and then apply hierarchical multi-block PLS (or PC) models. These ideas were presented by Wold, Martens and co-workers around 1986, but in rather obscure papers. With multivariate projection models such as PLS and PCA, however, the situation is different. These methods work well also with many variables even when the number of observations, N, is small. in fact, the larger the number of relevant variables, the more precise are the scores t (and u in PLS), because they have the characteristics of weighted averages of all the X- or Y - variables and an average is more precise the larger is the number of elements forming the basis of the average. There is therefore no real need for keeping the number of variables small; only really unimportant variables should be deleted to stabilize the model and its predictions.

42 And the statisticians did show up again

43 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ).

44 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). Again Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H

45 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). Again Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H Then the rate of convergence for the prediction error square, using PLS, is of order p ( p i=1 θ i 2 )n

46 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0

47 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0 If p i=1 θ i 2 p α, the order of convergence of the prediction square error is p(1 α) n.

48 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0 If p i=1 θ i 2 p α, the order of convergence of the prediction square error is p(1 α) n. When the maximum amount of information is accumulated ( p i=1 θ i 2 p) we have the traditional n consistency

49 More? Yes, we have general result of consistency (not only for the model presented in Chun and Kele s paper) The consistency of the prediction and rate of convergence depends more or less on the ratio between the information that new predictors contribute about y and the amount of noise they contribute

50 Simulation n = p/2 x y = µ x + Θν y + ω, (6) The columns of Θ were constructed to be orthogonal with the diagonal elements diag(θ T Θ) = (4p a, p a ), a = 1/2, 3/4, 1, and diag(θ T Θ) = (4c, c) = c(4p 0, p 0 ) where c is constant.

51 Simulation n = p/2 x y = µ x + Θν y + ω, (6) The columns of Θ were constructed to be orthogonal with the diagonal elements diag(θ T Θ) = (4p a, p a ), a = 1/2, 3/4, 1, and diag(θ T Θ) = (4c, c) = c(4p 0, p 0 ) where c is constant. The theoretical result for this case indicate D N = O p ( φ) with p φ = n ( p i=1 θ i 2 ). Here p i=1 θ i 2 p a with a = 1, 3/4, 1/2 and 0.

52 Theoretical result: D N = O p ( p n ( p i=1 θ ) Only no i 2 convergence for the case of diag(θ T Θ) c

53 Theoretical result: D N = O p ( p n ( p i=1 θ ) Only no i 2 convergence for the case of diag(θ T Θ) c

54 Tetracycline data Goicoechea and Olivieri (1999) used PLS to develop a predictor of tetracycline concentration in human blood. The 50 training samples were constructed by spiking blank sera with various amounts of tetracycline in the range 0 4 µg ml 1. A validation set of 57 samples was constructed in the same way. For each sample, the values of the predictors were determined by measuring fluorescence intensity at p = 101 equally spaced points in the range nm. The authors determined using leave-one-out cross validation that the best predictions of the training data were obtained with d = 4 linear combinations of the original 101 predictors.

55 Tetracycline data We use these data to illustrate the behavior of PLS predictions in Chemometrics as the number of predictors increases. We used PLS with d = 4 to predict the validation data based on p equally spaced spectra, with p ranging between 10 and 101. For those five values of p we compute the root mean squared error(mse).

56 MSE for tetracycline data for different values of p Telatively steep drop in MSE for small p, say less than 30, and a slow but steady decrease in MSE thereafter.

57 Thanks!

Dimension Reduction in Abundant High Dimensional Regressions

Dimension Reduction in Abundant High Dimensional Regressions Dennis Cook University of Minnesota 8th Purdue Symposium June 2012 In collaboration with Liliana Forzani & Adam Rothman, Annals of Statistics,