Inverse regression approach to (robust) non-linear high-to-low dimensional mapping

Similar documents
Inverse regression approach to robust non-linear high-to-low dimensional mapping

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables

Probabilistic Latent Semantic Analysis

A Weighted Multivariate Gaussian Markov Model For Brain Lesion Segmentation

Linear Dynamical Systems

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

CSC 2541: Bayesian Methods for Machine Learning

Note Set 5: Hidden Markov Models

Brain Lesion Segmentation: A Bayesian Weighted EM Approach

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Pattern Recognition and Machine Learning

CS281 Section 4: Factor Analysis and PCA

GWAS V: Gaussian processes

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Variational Inference via Stochastic Backpropagation

STA 4273H: Statistical Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Probabilistic Graphical Models: Exercises

Based on slides by Richard Zemel

Factor Analysis and Kalman Filtering (11/2/04)

CSC411 Fall 2018 Homework 5

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Latent Semantic Analysis

CPSC 540: Machine Learning

CS229 Lecture notes. Andrew Ng

Joint Factor Analysis for Speaker Verification

Mixtures of Robust Probabilistic Principal Component Analyzers

Linear Dynamical Systems (Kalman filter)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Nonparameteric Regression:

Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies

A latent variable modelling approach to the acoustic-to-articulatory mapping problem

Gibbs Sampling in Endogenous Variables Models

Variable selection for model-based clustering

Variational Bayesian Learning

Data Mining Techniques

Introduction to Machine Learning

Outline Lecture 2 2(32)

CPSC 540: Machine Learning

GAUSSIAN PROCESS REGRESSION

Robust Probabilistic Projections

Latent Variable Models

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Learning Gaussian Process Models from Uncertain Data

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Bayesian linear regression

arxiv: v5 [stat.ap] 20 Jul 2017

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Introduction to Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Dimension Reduction. David M. Blei. April 23, 2012

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Probabilistic & Unsupervised Learning

Algorithms for Variational Learning of Mixture of Gaussians

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

K-Means and Gaussian Mixture Models

Markov Chain Monte Carlo Methods for Stochastic Optimization

Lecture 6: April 19, 2002

Probabilistic Graphical Models

Unsupervised Learning

Robustness to Parametric Assumptions in Missing Data Models

Particle Filtering Approaches for Dynamic Stochastic Optimization

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Ch 4. Linear Models for Classification

Statistical Pattern Recognition

Lecture 2: From Linear Regression to Kalman Filter and Beyond

COM336: Neural Computing

Markov Chain Monte Carlo Methods for Stochastic

Latent Variable Models and EM algorithm

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Integrated Non-Factorized Variational Inference

Lecture 2: Simple Classifiers

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Probabilistic Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2017

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

The Expectation-Maximization Algorithm

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Machine Learning

Rao-Blackwellized Particle Filtering for 6-DOF Estimation of Attitude and Position via GPS and Inertial Sensors

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

E cient Importance Sampling

Fitting Narrow Emission Lines in X-ray Spectra

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Transcription:

Inverse regression approach to (robust) non-linear high-to-low dimensional mapping Emeline Perthame Joint work with Florence Forbes INRIA, team MISTIS, Grenoble LMNO, Caen October 27, 2016 1 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 2 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 3 / 25

A non linear mapping problem A non linear mapping problem y = y 1... y D g(y) x 1. x L = x Prediction of X from Y through a non linear regression function g with Y R D, X R L, D L E(X Y = y) = g(y) 4 / 25

A non linear mapping problem Application: Ω mission on Mars launch of a spectrometer around Mars Problem: Retrieving physical properties from hyperspectral images Y: spectrum (D=184) X: composition of the ground (L=3) Reflectance 0.1 0.2 0.3 0.4 0.5 Mars Express - Omega (2004) [http://geops.geol.u-psud.fr/] prop. of dust prop. of CO 2 ice prop. of water ice 0 50 100 150 Wavelength 5 / 25

Some approaches Difficulty: D large curse of dimensionality Solutions: via dimensionality reduction Reduce dimension of y before regression: eg. PCA on y Risk: poor prediction of x Take x into account: PLS, SIR, Kernel SIR, PC based methods Two steps approaches not expressed as a single optimization problem Our approach: inverse regression to reduce dimension 6 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 7 / 25

Proposed Method: An inverse regression strategy x R L low-dimensional space, y R D high-dimensional space, (y, x) are realizations of (Y, X ) p(y, X ; θ), θ parameters Inverse conditional density: p(y X ; θ) Y is a noisy function of X Modeled via mixtures Tractable θ estimation Forward conditional density: p(x Y ; θ ), with θ = f (θ) High-to-low prediction, eg. ˆX = E[X Y = Y ; θ ] 8 / 25

Student Locally-linear Mapping (SLLiM) A piecewise affine model: Introduce a missing variable Z Z = k Y is the image of X by an affine transformation K Y = I(Z = k)(a k X + b k + E k ) k=1 Definition of SLLiM p(y X, Z = k; θ) = S(Y ; A k X + b k, Σ k, α y k, γy k ) Affine transformations are local: mixture of K Student laws p(x Z = k; θ) = S(X ; c k, Γ k, α k, 1) p(z = k; θ) = π k The set of all model parameters is: θ = {π k, c k, Γ k, A k, b k, Σ k, α k, k = 1... K } 9 / 25

Why a Student mixture? Dealing with outliers Generalized Student distribution for the joint density of (X, Y ) S M (y; µ, Σ, α, γ) = Γ(α + M /2) Σ 1/2 Γ(α) (2πγ) M /2 [1 + δ(y, µ, Σ)/(2γ)] (α+m /2), Gaussian scale mixture representation (using weight variable U distributed according to a Gamma distribution ) S M (y; µ, Σ, α, γ) = 0 N M (y; µ, Σ/u) G(u; α, γ) du Parameters estimation is tractable by an EM algorithm Density 0.0 0.1 0.2 0.3 0.4 Gaussian Student α=0.1-6 -4-2 0 2 4 6 x 10 / 25

Low-to-high (Inverse) Regression If X and Y are both observed The parameter vector, θ, can be estimated in closed-form using an EM inference procedure This yields the inverse conditional density which is a Student mixture: p(y X ; θ) = K k=1 π k S(X ; c k, Γ k, α k, 1) K j =1 πj S(X ; cj, Γj, αj, 1) S(Y ; A k X + b k, Σ k α y k, γy k ) Both densities are Student mixtures parameterized by θ. Therefore, to obtain: A low-to-high inverse regression function: E[Y X = x; θ] = K k=1 π k S(x; c k, Γ k, α k, 1) K j =1 πj S(x; cj, Γj, α k, 1) (A k x + b k ), 11 / 25

High-to-low (Forward) Regression The forward conditional density is a Student mixture as well: p(x Y ; θ ) = K k=1 π k S(Y ; c k, Γ k, α k, 1) K j =1 π j S(Y ; c j, Γ j, αj, 1) S(X ; A k Y + b k, Σ k, α x k, γ x k ) The forward parameter vector, θ has an analytic expression as a function of θ Both densities are Student mixtures parameterized by θ. Therefore, to obtain: A high-to-low forward regression function: E[X Y = y; θ] = K k=1 π k S(y; c k, Γ k, α k, 1) K j =1 πj S(y; c j, Γ j, αj, 1) (A k y + b k ). 12 / 25

The forward parameter vector θ from θ c k = A k c k + b k, Γ k = Σ k + A k Γ k A T k, A k = Σ k A T k Σ 1 k, bk = Σ k (Γ 1 k c k A T k Σ 1 k b k ), Σ k = (Γ 1 k + A T k Σ 1 k A k ) 1. 13 / 25

A joint model approach to reduce the number of parameters Joint model p(x = x, Y = y Z = k) = S L+D ([ x y ] ) ; m k, V k, α k, 1 with [ ] c k m k = A k c k + b k [ ] Γk Γ k A T k and V k = A k Γ k Σ k + A k Γ k A T k Reduce the number of parameters to estimate Forward strategy + Γ k diagonal nb. par. = 1 D(D 1) + DL + 2L + D 2 D = 500, L = 2 126 254 parameters Inverse strategy + Σ k diagonal nb. par. 1 L(L 1) + DL + 2D + L 2 D = 500, L = 2 2 003 parameters 14 / 25

Extension to partially observed responses Incorporate a latent component into the low-dimensional variable: [ ] T X = W where T R L t is observed and W R Lw is latent (L = L t + L w) Example on Mars data: lighting? temperature? grain size? Observed pairs {(y n, T n), n = 1... N } (T R L t ) Additional latent variable W (W R Lw ) Assuming the independence of T and W given Z : p(x = (T, W ) Z = k) = S L ((T, W ) ; c k, Γ k, α k, 1) [ ] [ ] c t with c k = k Γ t, Γ 0 k = k 0 0 I Lw 15 / 25

Extension to partially observed responses Extension of SLLiM to more general covariance structure With A k = [ ] A t k A w k, K Y = I(Z = k)(a t k T + A w k W + b k + E k ) k=1 rewrites K Y = I(Z = k)(a t k T + b k + E k ) k=1 with Var(E k ) Σ k + A w k Aw k Diagonal Σ k Factor analysis with L w factors (at most) A compromise between full O(D 2 ) and diagonal O(D) covariances 16 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 17 / 25

Estimation of θ = (c k, Γ k, A k, b k, Σ k, π k, α k ) 1 k K by EM algorithm E-step Update posterior probabilities (E Z ) p(z = k t, y, θ (i) ) SMM-like (E W ) p(w Z = k, t, y, θ (i) ) Probabilistic PCA or Factor Analysis like (E U ) E(U Z = k, t, y, θ (i) ) Down-weighting extreme/atypic values in estimators More robust M-step (M X ) (π k, c k, Γ k ) SMM-like (M Y X ) (A k, b k, Σ k ) Hybrid between linear regression and PPCA/FA [ ] 0 0 Ã k = Ỹ k X k T ( 0 S k w + X k X k T ) 1 (M α) α k Not in closed-form but standard (specific to Student) 18 / 25

Outlines 1. Non linear mapping problem 2. GLLiM/SLLiM: inverse regression approach 3. Estimation of parameters 4. Results and conclusion 19 / 25

Application L = D = 1 RATP Subway in Paris Measure of air quality at Châtelet station, line 4 March 2015 N = 341 measures Prediction of NO (L=1) from NO 2 (D=1) Robustness of SLLiM NO 0 100 200 300 400 500 20 30 40 50 60 70 80 NO2 20 / 25

Application L = D = 1 / SLLiM compared to GLLiM NO 0 100 200 300 400 500 GLLiM SLLiM NO 0 100 200 300 400 500 GLLiM SLLiM 20 30 40 50 60 70 80 NO2 20 30 40 50 60 70 80 NO2 Illustration of robustness of the proposed model 21 / 25

Application L = D = 1 / SLLiM compared to GLLiM NRMSE 0.76 0.78 0.80 0.82 0.84 GLLiM SLLiM GLLiM-WO SLLiM-WO 1 2 3 4 5 6 7 8 9 10 K SLLiM achieves better prediction rates than GLLiM on complete data SLLiM becomes equivalent to GLLiM when outliers are removed 22 / 25

Other applications and augmented version of SLLiM Application when D L Hyperspectral data on Mars (D=184, L=2, N=6983) Comparison with other non linear regression methods Table: Mars data: average NRMSE and standard deviations in parenthesis for proportions of CO 2 ice and dust over 100 runs. Method Prop. of CO 2 ice Prop. of dust SLLiM (K=10) 0.168 (0.019) 0.145 (0.020) GLLiM (K=10) 0.180 (0.023) 0.155 (0.023) MARS 0.173 (0.016) 0.160 (0.021) SIR 0.243 (0.025) 0.157 (0.016) RVM 0.299 (0.021) 0.275 (0.034) 23 / 25

Results - Application to hyperspectral image analysis GLLiM SLLiM Splines Proportion of CO2 ice Proportion of dust 24 / 25

Conclusion and future work Mixture model used for prediction Addition of latent variables of partially observed responses Selection of K and L w K fixed? Or selected by BIC? L w selected by BIC? Thank you for your attention! Any questions? 25 / 25