Perplexity-free t-sne and twice Student tt-sne

Similar documents
Linear and Non-Linear Dimensionality Reduction

Extensive assessment of Barnes-Hut t-sne

Large-scale nonlinear dimensionality reduction for network intrusion detection

Curvilinear Components Analysis and Bregman Divergences

Fast Bootstrap for Least-square Support Vector Machines

Efficient unsupervised clustering for spatial bird population analysis along the Loire river

Using the Delta Test for Variable Selection

t-sne and its theoretical guarantee

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Long-Term Time Series Forecasting Using Self-Organizing Maps: the Double Vector Quantization Method

Supervised locally linear embedding

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Fisher Memory of Linear Wigner Echo State Networks

Entropic Affinities: Properties and Efficient Numerical Computation

Interpreting Deep Classifiers

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Self-Organization by Optimizing Free-Energy

VECTOR-QUANTIZATION BY DENSITY MATCHING IN THE MINIMUM KULLBACK-LEIBLER DIVERGENCE SENSE

Mixtures of Robust Probabilistic Principal Component Analyzers

Convergence Rate of Expectation-Maximization

Fractional Belief Propagation

Time series prediction

Generative Kernel PCA

Variational Principal Components

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Machine Learning (BSMC-GA 4439) Wenke Liu

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Iterative Laplacian Score for Feature Selection

Bootstrap for model selection: linear approximation of the optimism

Linking non-binned spike train kernels to several existing spike train metrics

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Comparison of Modern Stochastic Optimization Algorithms

Advanced Introduction to Machine Learning

Modeling of Growing Networks with Directional Attachment and Communities

Technical Details about the Expectation Maximization (EM) Algorithm

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

unsupervised learning

Matching the dimensionality of maps with that of the data

Randomized Algorithms

Multivariate class labeling in Robust Soft LVQ

A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation

13: Variational inference II

Optimal Kernel Shapes for Local Linear Regression

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Functional Radial Basis Function Networks (FRBFN)

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Nonlinear Dimensionality Reduction

Analysis of Multiclass Support Vector Machines

Gaussian Processes for Big Data. James Hensman

Multi-Layer Boosting for Pattern Recognition

Reading Group on Deep Learning Session 1

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Information Geometric view of Belief Propagation

Self Adaptive Particle Filter

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

A Computational Framework for Nonlinear Dimensionality Reduction of Large Data Sets: The Exploratory Inspection Machine (XIM)

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Non-parametric Residual Variance Estimation in Supervised Learning

Neural Networks and the Back-propagation Algorithm

Probabilistic and Bayesian Machine Learning

Bregman Divergences for Data Mining Meta-Algorithms

Information geometry for bivariate distribution control

The Elastic Embedding Algorithm for Dimensionality Reduction

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

c Springer, Reprinted with permission.

Bayesian Semi Non-negative Matrix Factorisation

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

A CUSUM approach for online change-point detection on curve sequences

Self Supervised Boosting

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Intelligent Systems:

25 : Graphical induced structured input/output models

Foundations of Nonparametric Bayesian Methods

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Discriminative Direction for Kernel Classifiers

REGRESSION TREE CREDIBILITY MODEL

Natural Gradients via the Variational Predictive Distribution

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

NP Completeness and Approximation Algorithms

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Chapter 9. Non-Parametric Density Function Estimation

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

The Expectation Maximization Algorithm

The Curse of Dimensionality in Data Mining and Time Series Prediction

Bayesian Machine Learning - Lecture 7

Unsupervised Learning with Permuted Data

Transcription:

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. Perplexity-free t-se and twice Student tt-se Cyril de Bodt, Dounia Mulders, Michel Verleysen and John A. Lee - Universite catholique de Louvain - ICTEAM/ELE Place du Levant 3 L5.3., 348 Louvain-la-euve - Belgium - Universite catholique de Louvain - IREC/MIRO Avenue Hippocrate 55 B.54.7, Brussels - Belgium Abstract. In dimensionality reduction and data visualisation, t-se has become a popular method. In this paper, we propose two variants to the Gaussian similarities used to characterise the neighbourhoods around each high-dimensional datum in t-se. A first alternative is to use t distributions like already used in the low-dimensional embedding space; a variable degree of freedom accounts for the intrinsic dimensionality of data. The second variant relies on compounds of Gaussian neighbourhoods with growing widths, thereby suppressing the need for the user to adjust a single size or perplexity. In both cases, heavy-tailed distributions thus characterise the neighbourhood relationships in the data space. Experiments show that both variants are competitive with t-se, at no extra cost. Introduction onlinear dimensionality reduction (LDR) aims at representing faithfully highdimensional (HD) data in low-dimensional (LD) spaces, mainly for exploratory visualisation or to foil the curse of dimensionality []. Several paradigms have been studied over time, like the reproduction of distances [] or neighbourhoods [3, 4] from the data space to the embedding space. As a typical method of multidimensional scaling based on distance preservation, Sammon s nonlinear mapping [5] has been popular for nearly 5 years, with about 36 citations on Google Scholar at the time of writing. As a more recent neighbourhood-preserving LDR method, t-distributed stochastic neighbour embedding (t-se) [6] has earned more than 34 citations in merely a single decade. Over these years, t-se has raised much interest, owing to its impressive results. Variants with alternative divergences as cost functions [7, 8, 9] and lower-complexity tree-based approximations [, ] have been developed. Properties of t-se have been investigated to explain why it outperforms blatantly many older LDR methods []. However, some questions remain. For instance, it remains difficult to justify the use of apparently unrelated neighbourhood distributions in the HD and LD spaces, namely, Gauss versus Student. For very high-dimensional data, the intuition of a crowding problem [6], related to the curse of dimensionality, motivates tails that are heavier in LD than in HD. Mid-range to distant neighbours are then repelled further away, which compensates for the relative lack of volume in LD spaces, compared to HD. This approach tends, however, to over-emphasise clusters or to be too extreme DM and CdB are FRS Research Fellows. JAL is a FRS Senior Research Associate. 3

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. for data with low intrinsic dimensionality. Adapting the degrees of freedom of the Student function has been a possible workaround [3]. We investigate here the use of twice Student SE, coined tt-se, with t functions in both spaces. Another question relates to the relevance of the perplexity, the main metaparameter in t-se. Following previous work [4], we show that compound Gaussian neighbourhoods, with exponentially growing perplexities, can be used in t-se to obtain multi-scale neighbourhoods with intermediate tails, between Gauss and Student. Such compounded Gaussian neighbourhoods span systematically a broad range of perplexities, relieving the user of its adjustment. Experiments show that the proposed variants are as competitive and efficient as t-se and they are applicable in a broader variety of cases. The rest of this paper is organised as follows. Section is a reminder of SE and t-se. Section 3 describes our contributions. Section 4 reports the experimental results. Section 5 draws the conclusions and sketches perspectives. SE and t-se Let Ξ = [ξ i ]i= denote a set of points in a HD metric space with M features. Let X = [xi ]i= represent it in a P -dimensional LD metric space, P M. The HD and LD distances between the ith and jth points are noted δ ij and d ij. SE defines HD and LD similarities, for i I = {,..., } and j I\{i} [4]: exp d ij / exp π i δ ij /, s ij = P, σ ii = s ii =. σ ij = P k I\{i} exp π i δ ik / k I\{i} exp d ik / Precisions π i are set to reach a user-fixedpperplexity? for the distribution [σ ij ; j I\{i}]: π i such that log? = j I\{i} σ ij log σ ij. The perplexity is interpreted as the size of the soft Gaussian neighbourhood. SE then finds the LD positions by minimising the sum P of divergences between the HD and LD similarity distributions: C S E = i I,j I\{i} σ ij log (σ ij /s ij ). Besides symmetrising the similarities, t-se uses a Student t function with one degree of freedom in the LD space, to cope with crowding problems [6]: σ ij,s = σ ij + σ ji, s ii,t =., s ij,t = P ( + d ij ) k I,l I\{k} ( + d kl ) The t-se cost function uses L divergences as in SE. It is minimised by gradient descent, each evaluation requiring O operations. The discrepancy between Gauss and Student distributions is arbitrarily fixed in t-se, without guarantee that the induced distance transformation is optimal. To address this issue, a variant of t-se has been described [3], where s ij,t has an additional parameter α, adjusting the degrees of freedom: s ij,t = ( + d ij /α)(α+)/ P k I,l I\{k} ( + d kl /α) (α+)/, () where α can be learned. A relation between α and the intrinsic dimensionality of data is hypothesised. If α, the Student function tends to a Gaussian. 4

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. 3 Heavy tails in the high-dimensional space A Student with infinite degrees of freedom boils down to a Gaussian. Conversely, a Student turns out to be a compound distribution, involving zero-mean Gaussians with gamma-distributed precisions. Marginalising over the precision leads to a Student. The shape and scale parameters of the gamma determine the degrees of freedom and scaling of the resulting Student. Hence, t-se appears as a variant of SE where both HD and LD neighbourhoods follow Student functions with, respectively, and degrees of freedom. Let us consider that (i) t-se works usually better for very high-dimensional than simple manifolds, and (ii) tweaking the degrees of freedom α in the LD neighbourhoods s ij,t has been tried [3], linking α and the intrinsic data dimensionality M. From there, we suggest using Student functions in both spaces, with degrees of freedom equal to M and P, instead of and in t-se. Formally, twice t-se, coined tt-se, relies thus on σ ij,ts = (σ ij,t + σ ji,t )/( ), σ ij,t = P ( + d ij /P ) P/, s =. P ij,t M / P/ k I\{i} ( + π i d kl /M ) k I,l I\{k} ( + d kl /P ) ( + π i d ij /M ) M / Unlike in [3], the degrees of freedom are fixed beforehand in both σ ij,t and s ij,t ; an ad hoc estimator of the intrinsic data dimensionality provides M [4]. The chosen degrees of freedom ensure consistency with genuine t-se in the LD space if P =, whereas the discrepancy between σ ij,t and s ij,t adapts with respect to the dimensionality gap between M and P. Precisions π i can be determined to reach the user-specified perplexity?. Learning the degrees of freedom as in [3] would increase running times, since the precisions should be retuned at each iteration. If both Gaussian and heavy-tailed distributions like Student s can be used for the HD neighbourhoods, then other intermediate distributions could be considered as well. Relying again on the definition of the Student as a compound of Gaussians with gamma-distributed precisions, we suggest a similar but discretised scheme. We use the multi-scale Gaussian similarities from previous work [4], defined as σ ij,s = (σ ij + σ ji )/( ), with σ ij = H exp( π hi δ ij /) X σ hij, and σ hij = P, H k I\{i} exp( π hi δ ik /) () h= where H = blog (/)e. Precisions π hi are adjusted by imposing exponentially growing perplexities?h = h with h H (or linearly growing entropies). Precisions are thus sampled in a data-driven way. The resulting compound distribution has a heavier tail than a single Gaussian, although it remains lighter than a Student since the widest Gaussian component σ Hij in σ ij keeps an exponentially decreasing tail. As a key advantage, multi-scale Gaussian neighbourhoods do no longer require the user to set a perplexity. Integrating them in t-se leads to multi-scale t-se or Ms.t-SE in short. 5

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. 4 Experiments and discussion The compared LDR methods are (i) SE [4] (ii) t-se in its genuine 8 implementation [6], (iii) a modified version where precisions are determined with ewton s method [5] instead of binary searches, (iv) tt-se, (v) Ms.t-SE, and (vi) Ms.SE [4]. Perplexities in the first four methods are fixed to 3. The data sets are a spherical manifold ( =, M = 3, M =.3) [4], COIL- ( = 44, M = 6384, M = 6.5) [6], a subset of MIST ( =, M = 784, M = 8.7) [7], and B. Frey s faces ( = 965, M = 56, M = 6.43). Intrinsic dimensionality was computed as in [4]. Target dimensionality P is two for all data sets. As to DR quality, some studies developed indicators of the HD neighbourhoods preservation in the LD space [8, 4], be coming generally adopted in several publications [7, 4]. Let sets ν i and n i index the nearest neighbours of ξ i and xi in the HD and LD space, respecp tively, with QX () = i I ν i n i /( ) [, ] measuring their average normalised agreement. As its expectation amounts to / ( ) for random LD points, RX () = (( ) QX () ) /( ) allows comparing different neighbourhood sizes [4]. The curve is displayed on a log-scale for as close neighbours prevail. Evaluated in O log time [8], its area P P AUC = = RX ()/ / = / [, ] grows with DR quality, quantified at all scales with an emphasis on small ones. The results are shown in Fig.. In terms of AUC, SE performs poorly compared to all t-se variants []. Precision computation with either binary search or ewton s method [5] has little impact on the results. Ms.SE [4] yields the best AUCs. Using heavy tails in the HD space, the proposed methods ttse and Ms.t-SE produce embeddings where clusters are less over-emphasised, since the HD-LD tail gap is actually tighter for these than for t-se. A Student with α = or 3 is indeed always closer to another Student (with α < ) than a Gaussian (a Student with α = ). Cluster over-emphasis, present in t-se and to a lesser extent in tt-se and Ms.t-SE, is slightly detrimental to the AUC, with weaker preservation of large neighbourhoods. Precision computation with t HD neighbourhoods in tt-se fails to converge for some points, without impact on the embedding quality, tt-se being only second to Ms.SE and Ms.t-SE. Ms.t-SE shows very good fidelity, without significant loss in larger neighbourhoods, with the advantage for the user of having no perplexity to adjust, like in Ms.SE. As to running times, all t-se variants have the same complexity of O, whereas Ms.SE runs in O log, explaining the overall best results of the latter. Computation of multi-scale neighbourhoods takes O log in Ms.t-SE but it is carried out only once at initialisation and is thus negligible with respect to the iterated gradient evaluation in O. 5 Conclusions and perspectives Two variants of t-se are proposed. The first replaces Gaussian neighbourhoods in the data space with Student t functions, to be more consistent with 6

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. the t neighbourhoods in the embedding space. The dimensionality gap between both spaces is accounted for by different degrees of freedom, determined in the HD space by the intrinsic data dimensionality. The second variant reuses the compound, multi-scale HD neighbourhoods described in [4] and matches them with LD t neighbourhoods; the user has no longer any perplexity or neighbourhood size to adjust. Experiments show that these variants are competitive with t-se and broadens its applicability. Perspectives include (i) the possibility to use multivariate t distributions to better accomodate for taking into account the intrinsic dimensionality, and (ii) adapting precision sampling in Ms.t-SE to better match the t function in the LD space. References [] J.A. Lee and M. Verleysen. onlinear dimensionality reduction. Springer, 7. [] I. Borg and P.J.F. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer-Verlag, ew York, 997. [3] T. ohonen. Self-Organizing Maps. Springer, Heidelberg, nd edition, 995. [4] G. Hinton and S.T. Roweis. Stochastic neighbor embedding. In Proc. (IPS ), pages 833 84. MIT Press, 3. [5] J.W. Sammon. A nonlinear mapping algorithm for data structure analysis. IEEE Transactions on Computers, CC-8(5):4 49, 969. [6] L. van der Maaten and G. Hinton. Visualizing data using t-se. Journal of Machine Learning Research, 9:579 65, 8. [7] J. Venna, J. Peltonen,. ybo, H. Aidos, and S. aski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. JMLR, :45 49,. [8]. Bunte, S. Haase, M. Biehl, and T. Villmann. Stochastic neighbor embedding (SE) for dimension reduction and visualization using arbitrary divergences. eurocomputing, 9:3 45,. [9] J.A. Lee, E. Renard, G. Bernard, P. Dupont, and M. Verleysen. Type and mixtures of ullback-leibler divergences as cost functions in dimensionality reduction based on similarity preservation. eurocomputing, :9 8, 3. [] Z. Yang, J. Peltonen, and S. aski. Scalable optimization of neighbor embedding for visualization. In Proc. ICML 3, pages 7 35. JMLR W&CP, 3. [] L. van der Maaten. Barnes-Hut-SE. In Proc. International Conference on Learning Representations, 3. [] J.A. Lee and M. Verleysen. Two key properties of dimensionality reduction methods. In Proc. IEEE SSCI CIDM, pages 63 7, 4. [3] L.J.P. van der Maaten. Learning a parametric embedding by preserving local structure. In Proc. th AISTATS, pages 384 39, Clearwater Beach, FL, 9. [4] J.A. Lee, D.H. Peluffo-Ordo n ez, and M. Verleysen. Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. eurocomputing, 69:46 6, 5. [5] M. Vladymyrov and M.A. Carreira-Perpin a n. Entropic affinities: Properties and efficient numerical computation. In Proc. 3th ICML, JMLR: W&CP. Atlanta, GA, 3. [6] S. A. ene, S.. ayar, and H. Murase. Columbia object image library (COIL-). Technical Report CUCS-5-96, Columbia University, 996. [7] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MIST database of handwritten digits, 998. [8] J.A. Lee and M. Verleysen. Quality assessment of dimensionality reduction: Rank-based criteria. eurocomputing, 7(7 9):43 443, 9. 7

ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. 9 8 7 RX () Sphere 6 5 4 3 58. 75.7 76.6 74.4 75.8 84. 3 9 8 7 RX () COIL- 6 5 4 3 53.5 65. 64. 66.4 68.4 7.4 3 8 8.7 46.3 46.3 46.5 48.3 49.4 7 6 RX () MIST 5 4 3 3 8 B. Frey s faces 7 RX () 6 5 4 3 4.9 57. 56.9 58. 59.4 6. 3 Fig. : Embeddings and RX () quality curves (with their AUC) for Sphere, COIL-, MIST, and Frey s faces. g Gaussian, t Student, Ms. multi-scale. 8