Inverse regression approach to (robust) non-linear high-to-low dimensional mapping

Inverse regression approach to (robust) non-linear high-to-low dimensional mapping Emeline Perthame Joint work with Florence Forbes INRIA, team MISTIS, Grenoble LMNO, Caen October 27, 2016 1 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 2 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 3 / 25

A non linear mapping problem A non linear mapping problem y = y 1... y D g(y) x 1. x L = x Prediction of X from Y through a non linear regression function g with Y R D, X R L, D L E(X Y = y) = g(y) 4 / 25

A non linear mapping problem Application: Ω mission on Mars launch of a spectrometer around Mars Problem: Retrieving physical properties from hyperspectral images Y: spectrum (D=184) X: composition of the ground (L=3) Reflectance 0.1 0.2 0.3 0.4 0.5 Mars Express - Omega (2004) [http://geops.geol.u-psud.fr/] prop. of dust prop. of CO 2 ice prop. of water ice 0 50 100 150 Wavelength 5 / 25

Some approaches Difficulty: D large curse of dimensionality Solutions: via dimensionality reduction Reduce dimension of y before regression: eg. PCA on y Risk: poor prediction of x Take x into account: PLS, SIR, Kernel SIR, PC based methods Two steps approaches not expressed as a single optimization problem Our approach: inverse regression to reduce dimension 6 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 7 / 25

Proposed Method: An inverse regression strategy x R L low-dimensional space, y R D high-dimensional space, (y, x) are realizations of (Y, X ) p(y, X ; θ), θ parameters Inverse conditional density: p(y X ; θ) Y is a noisy function of X Modeled via mixtures Tractable θ estimation Forward conditional density: p(x Y ; θ ), with θ = f (θ) High-to-low prediction, eg. ˆX = E[X Y = Y ; θ ] 8 / 25

Student Locally-linear Mapping (SLLiM) A piecewise affine model: Introduce a missing variable Z Z = k Y is the image of X by an affine transformation K Y = I(Z = k)(a k X + b k + E k ) k=1 Definition of SLLiM p(y X, Z = k; θ) = S(Y ; A k X + b k, Σ k, α y k, γy k ) Affine transformations are local: mixture of K Student laws p(x Z = k; θ) = S(X ; c k, Γ k, α k, 1) p(z = k; θ) = π k The set of all model parameters is: θ = {π k, c k, Γ k, A k, b k, Σ k, α k, k = 1... K } 9 / 25

Why a Student mixture? Dealing with outliers Generalized Student distribution for the joint density of (X, Y ) S M (y; µ, Σ, α, γ) = Γ(α + M /2) Σ 1/2 Γ(α) (2πγ) M /2 [1 + δ(y, µ, Σ)/(2γ)] (α+m /2), Gaussian scale mixture representation (using weight variable U distributed according to a Gamma distribution ) S M (y; µ, Σ, α, γ) = 0 N M (y; µ, Σ/u) G(u; α, γ) du Parameters estimation is tractable by an EM algorithm Density 0.0 0.1 0.2 0.3 0.4 Gaussian Student α=0.1-6 -4-2 0 2 4 6 x 10 / 25

Low-to-high (Inverse) Regression If X and Y are both observed The parameter vector, θ, can be estimated in closed-form using an EM inference procedure This yields the inverse conditional density which is a Student mixture: p(y X ; θ) = K k=1 π k S(X ; c k, Γ k, α k, 1) K j =1 πj S(X ; cj, Γj, αj, 1) S(Y ; A k X + b k, Σ k α y k, γy k ) Both densities are Student mixtures parameterized by θ. Therefore, to obtain: A low-to-high inverse regression function: E[Y X = x; θ] = K k=1 π k S(x; c k, Γ k, α k, 1) K j =1 πj S(x; cj, Γj, α k, 1) (A k x + b k ), 11 / 25

High-to-low (Forward) Regression The forward conditional density is a Student mixture as well: p(x Y ; θ ) = K k=1 π k S(Y ; c k, Γ k, α k, 1) K j =1 π j S(Y ; c j, Γ j, αj, 1) S(X ; A k Y + b k, Σ k, α x k, γ x k ) The forward parameter vector, θ has an analytic expression as a function of θ Both densities are Student mixtures parameterized by θ. Therefore, to obtain: A high-to-low forward regression function: E[X Y = y; θ] = K k=1 π k S(y; c k, Γ k, α k, 1) K j =1 πj S(y; c j, Γ j, αj, 1) (A k y + b k ). 12 / 25

The forward parameter vector θ from θ c k = A k c k + b k, Γ k = Σ k + A k Γ k A T k, A k = Σ k A T k Σ 1 k, bk = Σ k (Γ 1 k c k A T k Σ 1 k b k ), Σ k = (Γ 1 k + A T k Σ 1 k A k ) 1. 13 / 25

A joint model approach to reduce the number of parameters Joint model p(x = x, Y = y Z = k) = S L+D ([ x y ] ) ; m k, V k, α k, 1 with [ ] c k m k = A k c k + b k [ ] Γk Γ k A T k and V k = A k Γ k Σ k + A k Γ k A T k Reduce the number of parameters to estimate Forward strategy + Γ k diagonal nb. par. = 1 D(D 1) + DL + 2L + D 2 D = 500, L = 2 126 254 parameters Inverse strategy + Σ k diagonal nb. par. 1 L(L 1) + DL + 2D + L 2 D = 500, L = 2 2 003 parameters 14 / 25

Extension to partially observed responses Incorporate a latent component into the low-dimensional variable: [ ] T X = W where T R L t is observed and W R Lw is latent (L = L t + L w) Example on Mars data: lighting? temperature? grain size? Observed pairs {(y n, T n), n = 1... N } (T R L t ) Additional latent variable W (W R Lw ) Assuming the independence of T and W given Z : p(x = (T, W ) Z = k) = S L ((T, W ) ; c k, Γ k, α k, 1) [ ] [ ] c t with c k = k Γ t, Γ 0 k = k 0 0 I Lw 15 / 25

Extension to partially observed responses Extension of SLLiM to more general covariance structure With A k = [ ] A t k A w k, K Y = I(Z = k)(a t k T + A w k W + b k + E k ) k=1 rewrites K Y = I(Z = k)(a t k T + b k + E k ) k=1 with Var(E k ) Σ k + A w k Aw k Diagonal Σ k Factor analysis with L w factors (at most) A compromise between full O(D 2 ) and diagonal O(D) covariances 16 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 17 / 25

Estimation of θ = (c k, Γ k, A k, b k, Σ k, π k, α k ) 1 k K by EM algorithm E-step Update posterior probabilities (E Z ) p(z = k t, y, θ (i) ) SMM-like (E W ) p(w Z = k, t, y, θ (i) ) Probabilistic PCA or Factor Analysis like (E U ) E(U Z = k, t, y, θ (i) ) Down-weighting extreme/atypic values in estimators More robust M-step (M X ) (π k, c k, Γ k ) SMM-like (M Y X ) (A k, b k, Σ k ) Hybrid between linear regression and PPCA/FA [ ] 0 0 Ã k = Ỹ k X k T ( 0 S k w + X k X k T ) 1 (M α) α k Not in closed-form but standard (specific to Student) 18 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 19 / 25

Application L = D = 1 RATP Subway in Paris Measure of air quality at Châtelet station, line 4 March 2015 N = 341 measures Prediction of NO (L=1) from NO 2 (D=1) Robustness of SLLiM NO 0 100 200 300 400 500 20 30 40 50 60 70 80 NO2 20 / 25

Application L = D = 1 / SLLiM compared to GLLiM NO 0 100 200 300 400 500 GLLiM SLLiM NO 0 100 200 300 400 500 GLLiM SLLiM 20 30 40 50 60 70 80 NO2 20 30 40 50 60 70 80 NO2 Illustration of robustness of the proposed model 21 / 25

Application L = D = 1 / SLLiM compared to GLLiM NRMSE 0.76 0.78 0.80 0.82 0.84 GLLiM SLLiM GLLiM-WO SLLiM-WO 1 2 3 4 5 6 7 8 9 10 K SLLiM achieves better prediction rates than GLLiM on complete data SLLiM becomes equivalent to GLLiM when outliers are removed 22 / 25

Other applications and augmented version of SLLiM Application when D L Hyperspectral data on Mars (D=184, L=2, N=6983) Comparison with other non linear regression methods Table: Mars data: average NRMSE and standard deviations in parenthesis for proportions of CO 2 ice and dust over 100 runs. Method Prop. of CO 2 ice Prop. of dust SLLiM (K=10) 0.168 (0.019) 0.145 (0.020) GLLiM (K=10) 0.180 (0.023) 0.155 (0.023) MARS 0.173 (0.016) 0.160 (0.021) SIR 0.243 (0.025) 0.157 (0.016) RVM 0.299 (0.021) 0.275 (0.034) 23 / 25

Results - Application to hyperspectral image analysis GLLiM SLLiM Splines Proportion of CO2 ice Proportion of dust 24 / 25

Conclusion and future work Mixture model used for prediction Addition of latent variables of partially observed responses Selection of K and L w K fixed? Or selected by BIC? L w selected by BIC? Thank you for your attention! Any questions? 25 / 25