Robust cartogram visualization of outliers in manifold leaning

Size: px

Start display at page:

Download "Robust cartogram visualization of outliers in manifold leaning"

Meredith James
5 years ago
Views:

1 Robust cartogram visualization of outliers in manifold leaning Alessandra Tosi and Alfredo Vellido - LSI Department - UPC, Barcelona

2 1 Introduction Goals 2 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation 3 Cartograms representations for GTM and its variants Results

3 Table of Contents Introduction Goals 1 Introduction Goals 2 3

4 Goals PROBLEM: Increasing amount of available high-dimensional data sets, with different levels of complexity and growing diversity of characteristics.

5 Goals PROBLEM: Increasing amount of available high-dimensional data sets, with different levels of complexity and growing diversity of characteristics. CHALLENGE: Translation of raw data into useful information that can be acted upon in practical terms.

6 Goals PROBLEM: Increasing amount of available high-dimensional data sets, with different levels of complexity and growing diversity of characteristics. CHALLENGE: Translation of raw data into useful information that can be acted upon in practical terms. Nonlinear Dimensionality Reduction: Nonlinear techniques are applied to reduce dimensionality of data in order to explore multivariate data. It is almost impossible to completely avoid geometrical distortions while reducing dimensionality

7 Goals PROBLEM: Increasing amount of available high-dimensional data sets, with different levels of complexity and growing diversity of characteristics. CHALLENGE: Translation of raw data into useful information that can be acted upon in practical terms. Nonlinear Dimensionality Reduction: Nonlinear techniques are applied to reduce dimensionality of data in order to explore multivariate data. It is almost impossible to completely avoid geometrical distortions while reducing dimensionality Distortion Measures: Quantify and visualize this distortion itself in order to interpret data in a more faithful way.

8 Goals PROBLEM: Increasing amount of available high-dimensional data sets, with different levels of complexity and growing diversity of characteristics. CHALLENGE: Translation of raw data into useful information that can be acted upon in practical terms. Nonlinear Dimensionality Reduction: Nonlinear techniques are applied to reduce dimensionality of data in order to explore multivariate data. It is almost impossible to completely avoid geometrical distortions while reducing dimensionality Distortion Measures: Quantify and visualize this distortion itself in order to interpret data in a more faithful way. Visualization: Explicitly reintroducing the local distortion created by NLDR models into the low-dimensional representation of the MVD for visualization that they produce.

9 Table of Contents Introduction NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation 1 Introduction 2 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation 3

10 NLDR methods for MVD visualization NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation To successfully analyse real data, more complex models are often required: Nonlinear Dimensionality Reduction models (NLDR).

11 NLDR methods for MVD visualization NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation To successfully analyse real data, more complex models are often required: Nonlinear Dimensionality Reduction models (NLDR). Manifold learning attempts to describe MVD through nonlinear low-dimensional manifolds embedded in the observed data space. The aim is to discover the underlying geometry of data, while preserving the topology rather than pairwise distances and generating a lowdimensionality model.

NLDR methods for MVD visualization NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation To successfully analyse real data,

12 NLDR methods for MVD visualization NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation To successfully analyse real data, more complex models are often required: Nonlinear Dimensionality Reduction models (NLDR). Latent Variables Models attempt to provide an additional set of variables (latent or hidden variables) in addition to the observed ones.

13 NLDR methods for MVD visualization NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation To successfully analyse real data, more complex models are often required: Nonlinear Dimensionality Reduction models (NLDR). Vector quantization reduces the number of observation by replacing original data with a smaller set of vectors of the same dimension, called prototypes (units, neurons, centroids, weight vectors)

14 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Generative Topographic Mapping (GTM) The Generative Topographic Mapping (GTM) is a nonlinear Latent Variable Model developed by Bishop, Svensén and Williams in the late nineties.

15 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Generative Topographic Mapping (GTM) The Generative Topographic Mapping (GTM) is a nonlinear Latent Variable Model developed by Bishop, Svensén and Williams in the late nineties. Basic GTM defines a Gaussian probability distribution in the latent space, in order to induce the corresponding probability distribution in the observed data space, using concepts of Bayesian inference. Images of sampled data points, or prototypes, are defined according to the following rule: y k = WΦ(u k )

16 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Generative Topographic Mapping (GTM) The Generative Topographic Mapping (GTM) is a nonlinear Latent Variable Model developed by Bishop, Svensén and Williams in the late nineties. Basic GTM defines a Gaussian probability distribution in the latent space, in order to induce the corresponding probability distribution in the observed data space, using concepts of Bayesian inference. Images of sampled data points, or prototypes, are defined according to the following rule: y k = WΦ(u k ) The basic GTM model has some limitations when dealing with atypical data or outliers, as they are likely to bias the estimation of its parameters. More robust formulations of GTM have been proposed using a mixture of Student s t-distributions (t-gtm).

17 Magnification Factor Introduction NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation

18 Magnification Factor Introduction NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation da /da = det(jj T ) J is the Jacobian (of dimension 2 d) of the mapping transformation.

19 Cartograms Introduction NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation

20 Cartograms Introduction NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation (T x 1,T x 2) (x 1,x 2 ) d

21 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods We propose a Cartogram-based method, in which :

22 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods We propose a Cartogram-based method, in which : political borders of geographic maps are replaced by the square grid of latent points u k in the visualization space

23 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods We propose a Cartogram-based method, in which : political borders of geographic maps are replaced by the square grid of latent points u k in the visualization space map-underlying quantities such as density of population are replaced by the Magnification Factor

24 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods We propose a Cartogram-based method, in which : political borders of geographic maps are replaced by the square grid of latent points u k in the visualization space map-underlying quantities such as density of population are replaced by the Magnification Factor the level of distortion within each of the squares associated to u k is assumed to be uniform

25 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods We propose a Cartogram-based method, in which : political borders of geographic maps are replaced by the square grid of latent points u k in the visualization space map-underlying quantities such as density of population are replaced by the Magnification Factor the level of distortion within each of the squares associated to u k is assumed to be uniform the level of distortion in the space beyond this square grid is assumed to be uniform and equal to the mean distortion over the complete map, that is 1/K K k=1 J(u k), where J is the Jacobian of the transformation of the considered method

26 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods GOAL: better visualize the embedded model manifold, expecting inter-point distances in the observed data space to be more faithfully reflected in the low-dimensional representation space.

27 NLDR methods: Generative Topographic Mapping Distortion measures in NLDR: Magnification Factor Cartogram-based representation Cartograms representations for NLDR methods GOAL: better visualize the embedded model manifold, expecting inter-point distances in the observed data space to be more faithfully reflected in the low-dimensional representation space. An advantage of this cartogram-based method is its portability, as it should be easy to implement for different representation architectures and with alternative NLDR visualization techniques for which distortion can be quantified.

28 Table of Contents Introduction Cartograms representations for GTM and its variants Results 1 Introduction 2 3 Cartograms representations for GTM and its variants Results

29 Cartograms representations for t-gtm Cartograms representations for GTM and its variants Results In the following experiments we investigate the impact of outliers on the visualization using both basic GTM and t-gtm.

30 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Calculate over continuum the Jacobian J of the mapping transformation in basic GTM algorithm, in terms of the derivatives of the basis functions Φ, and apply the Magnification Factor (MF) formula: da /da = det(jj T )

31 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Calculate over continuum the Jacobian J of the mapping transformation in basic GTM algorithm, in terms of the derivatives of the basis functions Φ, and apply the Magnification Factor (MF) formula: da /da = det(jj T ) Basic GTM da /da = det(ψ T W T WΨ) where Ψ is a M 2 matrix with elements ψ mi = φ m / u i, m = 1,...,M,i = 1,...,2.

32 Cartograms representations for t-gtm Cartograms representations for GTM and its variants Results The conditional distribution of the observed data variables, given the latent variables, p(x u) takes the following form for t-gtm: p(x u,w,β,ν) = Γ(ν+D 2 )β D 2 Γ( ν 2 )(νπ)d/2(1+ β ν x y(u) 2 ) ν+d 2, (1) To implement the Magnification Factor, we explicitly calculate the Jacobian J = ΨW, where Ψ is a M 2 matrix with elements ϕ mi, defined as:

33 Cartograms representations for t-gtm Cartograms representations for GTM and its variants Results The conditional distribution of the observed data variables, given the latent variables, p(x u) takes the following form for t-gtm: p(x u,w,β,ν) = Γ(ν+D 2 )β D 2 Γ( ν 2 )(νπ)d/2(1+ β ν x y(u) 2 ) ν+d 2, (1) To implement the Magnification Factor, we explicitly calculate the Jacobian J = ΨW, where Ψ is a M 2 matrix with elements ϕ mi, defined as: t-gtm φ m u i = Γ(ν+D D+2 2 )( ν D)β 2 Γ( ν 2 )πd/2 ν D+2 2 ( u i µ i ) ( m 1+ β ) ν+d 2 ν u µ m 2 2 (2)

34 Cartograms representations for GTM Cartograms representations for GTM and its variants Results GTM t GTM Representation of data together with the manifold grid (GTM on the left, t-gtm on the right).

35 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Representation of MF maps and corresponding cartograms (GTM on the left, t-gtm on the right)

Representation of MF maps and corresponding cartograms

36 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Representation of MF maps and corresponding cartograms (GTM on the left, t-gtm on the right)

37 Cartograms representations for GTM Cartograms representations for GTM and its variants Results GTM t GTM Representation of data together with the manifold grid (GTM on the left, t-gtm on the right).

38 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Representation of MF maps and corresponding cartograms (GTM on the left, t-gtm on the right)

corresponding cartograms (GTM on the left, t-gtm on the right).

39 Cartograms representations for GTM Cartograms representations for GTM and its variants Results Representation of MF maps and corresponding cartograms (GTM on the left, t-gtm on the right)

40 Useful Links Introduction Cartograms representations for GTM and its variants Results Cartograms Software Somtoolbox for Matlab R Netlab3 3 for Matlab R

41 A short bibliography Introduction Cartograms representations for GTM and its variants Results M. Aupetit, Visualizing distortions and recovering topology in continuous projection techniques, Neurocomputing 70(7-9), pp , C.M. Bishop, M. Svensén and C.K.I. Williams, Magnification factors for the SOM and GTM algorithms, Proceedings of the Workshop on Self-Organizing Maps (WSOM 97), pp , June 4-6, Helsinki (Finland), M.T. Gastner and M.E.J. Newman, Diffusion-based method for producing density-equalizing maps, Proceedings of the National Academy of Sciences of the United States of America, 101(20), pp , National Academy of Sciences, A. Tosi, A. Vellido, Cartogram representation of the batch-som magnification factor. Proceedings of European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, pp , A. Vellido, Missing data imputation through GTM as a mixture of t-distributions, Neural Networks 19(10), pp , A. Vellido, Assessment of an Unsupervised Feature Selection Method for Generative Topographic Mapping. 16th International Conference on Artificial Neural Networks (ICANN), Athens, Greece. LNCS Vol.4132, pp , A. Vellido, P.J.G. Lisboa, D. Vicente, Robust analysis of MRS brain tumour data using t-gtm, Neurocomputing, 69(7-9), pp , A. Vellido, J.D. Martín, F. Rossi, P.J.G. Lisboa, Seeing is believing: The importance of visualization in real-world machine learning applications, In M. Verleysen, editor, Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp , Bruges, Belgium, A. Vellido, J.D. MartÃn-Guerrero, P.J.G. Lisboa, Making machine learning models interpretable. Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp , 2012.

42 THANK YOU - QUESTIONS? Cartograms representations for GTM and its variants Results Alessandra Tosi - atosi/ atosi@lsi.upc.edu Alfredo Vellido - avellido/ avellido@lsi.upc.edu

Probability ridges and distortion flows: Visualizing multivariate time series using a variational Bayesian manifold learning method

Probability ridges and distortion flows: Visualizing multivariate time series using a variational Bayesian manifold learning method Alessandra Tosi 1, Iván Olier 2, and Alfredo Vellido 1 1 Dept. de Llenguatges