Linear and Non-Linear Dimensionality Reduction

Similar documents
Large-scale nonlinear dimensionality reduction for network intrusion detection

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Discriminative Direction for Kernel Classifiers

Efficient unsupervised clustering for spatial bird population analysis along the Loire river

Semi-Supervised Learning through Principal Directions Estimation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Perplexity-free t-sne and twice Student tt-sne

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Machine Learning

Interpreting Deep Classifiers

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Machine learning for pervasive systems Classification in high-dimensional spaces

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Kernel methods for comparing distributions, measuring dependence

Nonlinear Dimensionality Reduction

Unsupervised dimensionality reduction

Approximate Kernel Methods

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Statistical Pattern Recognition

t-sne and its theoretical guarantee

Statistical Machine Learning

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linking non-binned spike train kernels to several existing spike train metrics

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Unsupervised Kernel Dimension Reduction Supplemental Material

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Dimension Reduction and Low-dimensional Embedding

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Distance Metric Learning

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Principal Component Analysis

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Learning a kernel matrix for nonlinear dimensionality reduction

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition

Non-linear Dimensionality Reduction

Support Vector Machine (continued)

Dimensionality Reduction AShortTutorial

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Approximate Kernel PCA with Random Features

Nonlinear Dimensionality Reduction

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Lecture 10: Dimension Reduction Techniques

Incorporating Invariances in Nonlinear Support Vector Machines

Learning gradients: prescriptive models

Analysis of Multiclass Support Vector Machines

The Curse of Dimensionality for Local Kernel Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Probabilistic and Statistical Learning. Alexandros Iosifidis Academy of Finland Postdoctoral Research Fellow (term )

High Dimensional Discriminant Analysis

Kernel Methods. Machine Learning A W VO

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

arxiv: v1 [cs.lg] 9 Apr 2008

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Local Learning Projections

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Cluster Kernels for Semi-Supervised Learning

Unsupervised learning: beyond simple clustering and PCA

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Self-Organization by Optimizing Free-Energy

A Duality View of Spectral Methods for Dimensionality Reduction

Iterative Laplacian Score for Feature Selection

Nonlinear Manifold Learning Summary

Graph-Based Semi-Supervised Learning

Variational Principal Components

Dimensionality reduction. PCA. Kernel PCA.

STA 414/2104: Lecture 8

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

A Duality View of Spectral Methods for Dimensionality Reduction

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Chemometrics: Classification of spectra

Table of Contents. Multivariate methods. Introduction II. Introduction I

Descriptive Statistics

Gaussian Process Latent Random Field

Locality Preserving Projections

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology

Nonlinear Dimensionality Reduction. Jose A. Costa

Support Vector Machines

Learning Binary Classifiers for Multi-Class Problem

Transcription:

Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215

Overview Dimensionality Reduction Motivation Linear Projections Linear Mappings in Feature Space Neighbor Embedding Manifold Learning Advances in Dimensionality Reduction Speedup for Neighbor Embeddings Quality Assessment of DR Feature Relevance for DR Visualization of Classifiers Supervised Dimensionality Reduction

Curse of Dimensionality 1 1 [Lee and Verleysen, 27]

Curse of Dimensionality 1 high-dimensional spaces are almost empty 1 [Lee and Verleysen, 27]

Curse of Dimensionality 1 high-dimensional spaces are almost empty Hypervolume concentrates in a thin shell close to the surface 1 [Lee and Verleysen, 27]

Why use Dimensionality Reduction? 81 34 3 14 247 153 7 255 255 175 14 54 227 255 255 98 168 255 255 255 179 7 33 157 254 255 247 162 13 72 27 255 255 255 124 29 149 25 255 255 228 16 2 6 148 255 255 255 245 112 56 195 255 255 255 169 15 5 25 255 254 28 61 1 39 84 255 219 127 3 186 234 211 41 15 255 23 15 175 224 48 63 52

Why use Dimensionality Reduction?

Why use Dimensionality Reduction?

Principal Component Analysis (PCA) variance maximization var ( w x i ) with w = 1

Principal Component Analysis (PCA) variance maximization var ( w ) x i with w = 1 = 1 N i (w x i ) 2 i w x i x i w = 1 N

Principal Component Analysis (PCA) variance maximization var ( w ) x i with w = 1 = 1 N i (w x i ) 2 = 1 N i w x i x i w = w Cw

Principal Component Analysis (PCA) variance maximization var ( w ) x i with w = 1 = 1 N i (w x i ) 2 = 1 N i w x i x i w = w Cw Eigenvectors of the covariance matrix are optimal

PCA in Action 2.5 2 1 1.5.5 1.5.5 1.5 1.5 1 1.5 1 1.5.5 1.5 2 1.5 1.5.5 1 1.5 1 1 2 2 [Gisbrecht and Hammer, 215]

Distance Preservation (CMDS) distances δ ij = x i x j 2, d ij = y i y j 2 objective δ ij d ij

Distance Preservation (CMDS) distances δ ij = x i x j 2, d ij = y i y j 2 objective δ ij d ij distances d ij and similarities s ij can be transformed into each other S = UΛU matrix of pairwise similarities

Distance Preservation (CMDS) distances δ ij = x i x j 2, d ij = y i y j 2 objective δ ij d ij distances d ij and similarities s ij can be transformed into each other S = UΛU matrix of pairwise similarities best low rank approximation of S in Frobenius norm is S = U ΛU with the largest eigenvalue

Manifold Learning learn linear manifold represent data as projections on unknown w: C = 1 2N i (x i y i w) 2

Manifold Learning learn linear manifold represent data as projections on unknown w: C = 1 2N i (x i y i w) 2 What are the best parameters y i and w?

PCA three ways to obtain PCA maximize variance of a linear projection preserve distances find a linear manifold such that errors are minimal in an L2 sense

Kernel PCA 3 Idea Apply a fixed nonlinear preprocessing φ(x) Perform standard PCA in feature space 3 [Schölkopf et al., 1998]

Kernel PCA 3 Idea Apply a fixed nonlinear preprocessing φ(x) Perform standard PCA in feature space How to apply the kernel trick here? 3 [Schölkopf et al., 1998]

Kernel PCA in Action 2.5 2 1 1.5.5 1.5.5 1.5 1.5 1 1.5 1 1.5.5 1.5 2 1.5 1.5.5 1 1.5 1 1 4 4 [Gisbrecht and Hammer, 215]

Stochastic Neighbor Embedding (SNE) 5 introduce a probabilistic neighborhood in the input space p j i = exp(.5 x i x j 2 /σi 2 ) k,k i exp(.5 x i x k 2 /σi 2 ) 5 [Hinton and Roweis, 22]

Stochastic Neighbor Embedding (SNE) 5 introduce a probabilistic neighborhood in the input space p j i = exp(.5 x i x j 2 /σi 2 ) k,k i exp(.5 x i x k 2 /σi 2 ) and in the output space q j i = exp(.5 y i y j 2 ) k,k i exp(.5 y i y k 2 ) 5 [Hinton and Roweis, 22]

Stochastic Neighbor Embedding (SNE) 5 introduce a probabilistic neighborhood in the input space p j i = exp(.5 x i x j 2 /σi 2 ) k,k i exp(.5 x i x k 2 /σi 2 ) and in the output space q j i = exp(.5 y i y j 2 ) k,k i exp(.5 y i y k 2 ) optimize the sum of Kullback-Leibler divergences C = KL(P i, Q i ) = i i p j i log j i ( pj i q j i ) 5 [Hinton and Roweis, 22]

Neighbor Retrieval Visualizer (NeRV) 6 6 [Venna et al., 21]

Neighbor Retrieval Visualizer (NeRV) 6 precision(i) = N TP,i k i = 1 N FP,i k i, recall(i) = N TP,i r i = 1 N MISS,i r i 6 [Venna et al., 21]

Neighbor Retrieval Visualizer (NeRV) KL(P i, Q i ) generalizes recall KL(Q i, P i ) generalizes precision

Neighbor Retrieval Visualizer (NeRV) KL(P i, Q i ) generalizes recall KL(Q i, P i ) generalizes precision NeRV optimizes C = λ i KL(P i, Q i ) + (1 λ) i KL(Q i, P i )

t-distributed SNE (t-sne) 7 symmetrized probabilities p and q uses a Student-t distribution in the output space 7 [van der Maaten and Hinton, 28]

t-distributed SNE (t-sne) 7 symmetrized probabilities p and q uses a Student-t distribution in the output space 7 [van der Maaten and Hinton, 28]

Neighbor Embedding in Action 2.5 2 1 1.5.5 1.5.5 1.5 1.5 1 1.5 1 1.5.5 1.5 2 1.5 1.5.5 1 1.5 1 1 8 8 [Gisbrecht and Hammer, 215]

Maximum Variance Unfolding (MVU) 9 goal: unfold a given manifold while keeping all the local distances and angles fixed 9 [Weinberger and Saul, 26]

Maximum Variance Unfolding (MVU) 9 goal: unfold a given manifold while keeping all the local distances and angles fixed maximize ij y i y j 2 s.t. i y i = y i y j 2 = x i x j 2, for all neighbors 9 [Weinberger and Saul, 26]

Manifold Learner in Action 2.5 2 1 1.5.5 1.5.5 1.5 1.5 1 1.5 1 1.5.5 1.5 2 1.5 1.5.5 1 1.5 1 1 1 1 [Gisbrecht and Hammer, 215]

Overview Dimensionality Reduction Motivation Linear Projections Linear Mappings in Feature Space Neighbor Embedding Manifold Learning Advances in Dimensionality Reduction Speedup for Neighbor Embeddings Quality Assessment of DR Feature Relevance for DR Visualization of Classifiers Supervised Dimensionality Reduction

Barnes Hut Approach 11 Complexity of NE Neighbor Embeddings have the complexity O(N 2 ) 11 [Yang et al., 213, van der Maaten, 213]

Barnes Hut Approach 11 Complexity of NE Neighbor Embeddings have the complexity O(N 2 ) matrices P and Q are squared 11 [Yang et al., 213, van der Maaten, 213]

Barnes Hut Approach 11 Complexity of NE Neighbor Embeddings have the complexity O(N 2 ) matrices P and Q are squared squared summation for the gradient C y i = j i g ij (y i y j ) 11 [Yang et al., 213, van der Maaten, 213]

12 [van der Maaten, 213] 12

Barnes Hut Approach Barnes Hut approximate the gradient C y i = j i g ij(y i y j ) as g ij (y i y j ) Gt i g ij (y i ŷ i t) t j i

Barnes Hut Approach Barnes Hut approximate the gradient C y i = j i g ij(y i y j ) as g ij (y i y j ) Gt i g ij (y i ŷ i t) t j i approximate P as sparse matrix results in a O(N log N) algorithm

Barnes Hut SNE 13 13 [van der Maaten, 213]

Kernel t-sne 14 O(N) algorithm: apply NE to a fixed subset, map remainder with out of sample projection 14 [Gisbrecht et al., 215]

Kernel t-sne 14 O(N) algorithm: apply NE to a fixed subset, map remainder with out of sample projection how to obtain an out of sample extension? 14 [Gisbrecht et al., 215]

Kernel t-sne 14 O(N) algorithm: apply NE to a fixed subset, map remainder with out of sample projection how to obtain an out of sample extension? use kernel mapping x y(x) = j α j k(x, x j ) l k(x, x l) = Ak 14 [Gisbrecht et al., 215]

Kernel t-sne 14 O(N) algorithm: apply NE to a fixed subset, map remainder with out of sample projection how to obtain an out of sample extension? use kernel mapping x y(x) = j α j k(x, x j ) l k(x, x l) = Ak minimization of y i y(x i ) 2 yields A = Y K 1 i 14 [Gisbrecht et al., 215]

Kernel t-sne 15 15 [Gisbrecht et al., 215]

Quality Assessment of DR 16 most popular measure Q k (X, Y ) = i ( ) N k ( x i ) N k ( y i ) /(Nk) 16 [Lee et al., 213]

Quality Assessment of DR 16 most popular measure Q k (X, Y ) = i ( ) N k ( x i ) N k ( y i ) /(Nk) rescaling added recently 16 [Lee et al., 213]

Quality Assessment of DR 16 most popular measure Q k (X, Y ) = i ( ) N k ( x i ) N k ( y i ) /(Nk) rescaling added recently recently used to compare many DR techniques 16 [Lee et al., 213]

Quality Assessment of DR 17 17 [Peluffo-Ordóñez et al., 214]

Feature Relevance for DR Which features are important for a given projection?

data set: ESANN participants Partic. University ESANN paper #publications... likes beer... 1 A 1 15 1... 2 A 8-1... 3 B 1 22... -1... 4 C 1 9...... 5 C 15-1... 6 D......

Visualization of ESANN participants 5 4 3 2 Dim 2 1 1 2 3 4 2 4 6 8 1 Dim 1

Relevance of features for the projection 12 1 8 6 4 2 1 2 3 4 5 6 7 8 9 1

Relevance of features for the projection 12 1 8 6 4 2 1 2 3 4 5 6 7 8 9 1

Visualization of ESANN participants 5 4 3 Beer 1 Beer Beer -1 2 Dim 2 1 1 2 3 4 2 4 6 8 1 Dim 1

Aim estimate the relevance of single features for non linear dimensionality reductions

Aim estimate the relevance of single features for non linear dimensionality reductions Idea change the influence of a single feature and observe the change in the quality

Evaluation of Feature Relevance for DR 19 NeRV cost function 18 Q NeRV k interpretation from an information retrieval perspective 18 [Venna et al., 21] 19 [Schulz et al., 214a]

Evaluation of Feature Relevance for DR 19 NeRV cost function 18 Q NeRV k interpretation from an information retrieval perspective d( x i, x j ) 2 = l (x l i x j l )2 becomes l λ2 l (x l i x j l )2 18 [Venna et al., 21] 19 [Schulz et al., 214a]

Evaluation of Feature Relevance for DR 19 NeRV cost function 18 Q NeRV k interpretation from an information retrieval perspective d( x i, x j ) 2 = l (x l i x j l )2 becomes l λ2 l (x l i x j l )2 λ k NeRV (l) := λ2 l where λ optimizes Qk NeRV (X λ, Y ) + δ l λ2 l 18 [Venna et al., 21] 19 [Schulz et al., 214a]

Toy data set 1.2 C1 C2 C3.1 Dim 3.1.6.4.2.2 Dim 2.6.4.2 Dim 1.2.4

Toy data set 1 Dim 3.2.1.1.6.4 C1 C2 C3 DimII 4 3 2 1 1 2.2.2 Dim 2.6.4.2 Dim 1.2.4 3 4 5 C 1 C 2 C 3 6 5 4 3 2 1 1 2 3 4 5 DimI

Toy data set 1 Dim 3.2.1.1.6.4 C1 C2 C3 DimII 4 3 2 1 1 2.2.2 Dim 2.6.4.2 Dim 1.2.4 3 4 5 C 1 C 2 C 3 6 5 4 3 2 1 1 2 3 4 5 DimI 1.6 1.4 1.2 Dim 1 Dim 2 Dim 3 λ NeRV 1.8.6.4.2 1 2 3 4 5 6 Neighborhood size

Toy data set 2 C 1 C 2 C 3.1 Dim 3.1.4.2 Dim 2.2.4.4.2.2.4 Dim 1

Toy data set 2 C 1 C 2 C 3.1 Dim 3.1.4.2 Dim 2.2.4.4.2.2.4 Dim 1.6.4 C 1 C 2 C 3.2 DimII.2.4.6.8.8.6.4.2.2.4.6 DimI

Toy data set 2 C 1 C 2 C 3.1 Dim 3.1.4.2 Dim 2.2.4.4.2.2.4 Dim 1.6.4.2 C 1 C 2 C 3 4 3 2 DimII.2 DimII 1 1.4.6 2 3 C 1 C 2 C 3.8.8.6.4.2.2.4.6 DimI 4 5 4 3 2 1 1 2 3 4 5 DimI

Relevances for different projections.6.4.2 C 1 C 2 C 3 4 3 2 DimII.2 DimII 1 1.4.6 2 3 C 1 C 2 C 3.8.8.6.4.2.2.4.6 DimI 4 5 4 3 2 1 1 2 3 4 5 DimI 1.8 1.6 1.4 Dim 1 Dim 2 Dim 3 1.2 λ NeRV 1.8.6.4.2 1 2 3 4 5 6 Neighborhood size

Relevances for different projections.6.4.2 C 1 C 2 C 3 4 3 2 DimII.2 DimII 1 1.4.6 2 3 C 1 C 2 C 3.8.8.6.4.2.2.4.6 DimI 4 5 4 3 2 1 1 2 3 4 5 DimI 1.8 1.6 1.4 1.2 Dim 1 Dim 2 Dim 3 5 4.5 4 3.5 Dim 1 Dim 2 Dim 3 λ NeRV 1.8 λ NeRV 3 2.5 2.6 1.5.4 1.2.5 1 2 3 4 5 6 Neighborhood size 1 2 3 4 5 6 Neighborhood size

98.7%

Why visualize Classifiers?

Why visualize Classifiers?

Why visualize Classifiers?

Where is the Difficulty? Class borders are often non linear often not given in an explicit functional form (e.g. SVM) high dimensional which makes it non feasible to sample them for a projection

An illustration the approach 1.9.8.7 class 1 class 2 class 1 class 2.6 dim3.5.4 dim3 1.5.3.2.2.1.5 dim2 1 1.8.6 dim1.4.2.2.4.6.8 dim2 1 1.4.6.8 dim1

An illustration the approach 1.9.8.7 class 1 class 2 class 1 class 2.6 dim3.5.4 dim3 1.5.3.2.2.1.5 dim2 1 1.8.6 dim1.4.2.2.4.6.8 dim2 1 1.4.6.8 dim1 project data to 2D

An illustration: dimensionality reduction class 1 class 2 SVs dimii dimi

An illustration: dimensionality reduction class 1 dimii dimi sample the 2D data space project the samples up

An illustration: dimensionality reduction 1.2 class 1 class 1 class 2 SVs 2D samples 1 dim3.8.6 dimii.4.2.5 1 dimi dim1 I sample the 2D data space I project the samples up.2.4 dim2.6.8 1

An illustration: dimensionality reduction 1.2 class 1 class 1 class 2 SVs 2D samples 1 dim3.8.6 dimii.4.2.5 1 dimi dim1 I sample the 2D data space I project the samples up I.2 classify them.4 dim2.6.8 1

An illustration: border visualization color intensity codes the certainty of the classifier

Classifier visualization2 dimii class 1 class 2 SVs class 1 I project data to 2D I sample the 2D data space dimii dimi 1.2 class 1 class 2 SVs 2D samples 1 project the samples up.8 dim3 I.6 dimi.4.2 I classify them.5 1 dim1 2 [Schulz et al., 214b].2.4 dim2.6.8 1

Toy Data Set 1 dim3 C1 C2 dim2 dim1

Toy Data Set 1 dim3 C1 C2 dim2 dim1

Toy Data Set 2 C1 C2 dim3 dim2 dim1

Toy Data Set 2 C1 C2 dim3 dim2 dim1

Toy Data Set 2 with NE C1 C2 dim3 dim2 dim1

Toy Data Set 2 with NE C1 C2 dim3 dim2 dim1

DR: intrinsically 3D data C1 C2 C1 C2

DR: NE projection

Supervised dimensionality reduction Use the Fisher metric 21 d(x, x + dx) = x J(x)x J(x) = E p(c x) { ( x log p(c x) )( x log p(c x) ) } 21 [Peltonen et al., 24, Gisbrecht et al., 215]

Supervised dimensionality reduction Use the Fisher metric 22 d(x, x + dx) = x J(x)x J(x) = E p(c x) { ( x log p(c x) )( x log p(c x) ) } 22 [Peltonen et al., 24, Gisbrecht et al., 215]

DR: supervised NE projection

DR: supervised NE projection

prototypes in original space C1 C2

Summary different objectives of dimensionality reduction

Summary different objectives of dimensionality reduction new approach to get insight into trained classification models discriminative information can yield major imrpovements

Thank You For Your Attention! Alexander Schulz aschulz (at) techfak.uni-bielefeld.de

Literature I Gisbrecht, A. and Hammer, B. (215). Data visualization by nonlinear dimensionality reduction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(2):51 73. Gisbrecht, A., Schulz, A., and Hammer, B. (215). Parametric nonlinear dimensionality reduction using kernel t-sne. Neurocomputing, 147:71 82. Hinton, G. and Roweis, S. (22). Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15, pages 833 84. MIT Press. Lee, J. A., Renard, E., Bernard, G., Dupont, P., and Verleysen, M. (213). Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomput., 112:92 18. Lee, J. A. and Verleysen, M. (27). Nonlinear dimensionality reduction. Springer. Peltonen, J., Klami, A., and Kaski, S. (24). Improved learning of riemannian metrics for exploratory analysis. Neural Networks, 17:187 11.

Literature II Peluffo-Ordóñez, D. H., Lee, J. A., and Verleysen, M. (214). Recent methods for dimensionality reduction: A brief comparative analysis. In 22th European Symposium on Artificial Neural Networks, ESANN 214, Bruges, Belgium, April 23-25, 214. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 1(5):1299 1319. Schulz, A., Gisbrecht, A., and Hammer, B. (214a). Relevance learning for dimensionality reduction. In 22th European Symposium on Artificial Neural Networks, ESANN 214, Bruges, Belgium, April 23-25, 214. Schulz, A., Gisbrecht, A., and Hammer, B. (214b). Using discriminative dimensionality reduction to visualize classifiers. Neural Processing Letters, pages 1 28. van der Maaten, L. (213). Barnes-hut-sne. CoRR, abs/131.3342. van der Maaten, L. and Hinton, G. (28). Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579 265. Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. (21). Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11:451 49.

Literature III Weinberger, K. Q. and Saul, L. K. (26). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI. Yang, Z., Peltonen, J., and Kaski, S. (213). Scalable optimization of neighbor embedding for visualization. In Dasgupta, S. and Mcallester, D., editors, Proceedings of the 3th International Conference on Machine Learning (ICML-13), volume 28, pages 127 135. JMLR Workshop and Conference Proceedings.