Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Similar documents
Divergence based Learning Vector Quantization

Information Theory Related Learning

Non-Euclidean Independent Component Analysis and Oja's Learning

Linear and Non-Linear Dimensionality Reduction

Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization using arbitrary Divergences

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Divergence based classification in Learning Vector Quantization

Multivariate class labeling in Robust Soft LVQ

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition

Convexity/Concavity of Renyi Entropy and α-mutual Information

Interpreting Deep Classifiers

An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems

Independent Component Analysis and Unsupervised Learning

VECTOR-QUANTIZATION BY DENSITY MATCHING IN THE MINIMUM KULLBACK-LEIBLER DIVERGENCE SENSE

A Convex Cauchy-Schwarz Divergence Measure for Blind Source Separation

Limits from l Hôpital rule: Shannon entropy as limit cases of Rényi and Tsallis entropies

On the Chi square and higher-order Chi distances for approximating f-divergences

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Non-negative matrix factorization and term structure of interest rates

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Learning features by contrasting natural images with noise

Literature on Bregman divergences

Unsupervised Kernel Dimension Reduction Supplemental Material

arxiv: v4 [cs.it] 17 Oct 2015

Integrating Correlated Bayesian Networks Using Maximum Entropy

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

On the Chi square and higher-order Chi distances for approximating f-divergences

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Stochastic Variational Inference

Online Estimation of Discrete Densities using Classifier Chains

Modeling of Growing Networks with Directional Attachment and Communities

Self-Organization by Optimizing Free-Energy

Bayesian Regularization

Latent Variable Models and EM algorithm

Some New Information Inequalities Involving f-divergences

Posterior Regularization

Geometry of U-Boost Algorithms

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Maximum Within-Cluster Association

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Online Nonnegative Matrix Factorization with General Divergences

The Variational Gaussian Approximation Revisited

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Metric Embedding for Kernel Classification Rules

Areproducing kernel Hilbert space (RKHS) is a special

t-sne and its theoretical guarantee

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

On Improved Bounds for Probability Metrics and f- Divergences

Divergences, surrogate loss functions and experimental design

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

An Improved Cumulant Based Method for Independent Component Analysis

Research Article On Some Improvements of the Jensen Inequality with Some Applications

Generalized Information Potential Criterion for Adaptive System Training

Perplexity-free t-sne and twice Student tt-sne

Non-Negative Matrix Factorization with Quasi-Newton Optimization

CS281 Section 4: Factor Analysis and PCA

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions

Aspects in Classification Learning - Review of Recent Developments in Learning Vector Quantization

Machine Learning (BSMC-GA 4439) Wenke Liu

Belief Propagation, Information Projections, and Dykstra s Algorithm

Data Mining and Matrices

Surrogate loss functions, divergences and decentralized detection

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Information geometry of Bayesian statistics

Information Theoretical Estimators Toolbox

Feedforward Neural Networks

Convergence Rate of Expectation-Maximization

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Scalable Hash-Based Estimation of Divergence Measures

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Variational Principal Components

Mutual Information and Optimal Data Coding

ECE521 Lectures 9 Fully Connected Neural Networks

Information geometry for bivariate distribution control

Nonnegative Tensor Factorization with Smoothness Constraints

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Does Better Inference mean Better Learning?


Machine Learning using Bayesian Approaches

Information Geometric view of Belief Propagation

Scale-Invariant Divergences for Density Functions

arxiv: v2 [cs.cv] 19 Dec 2011

Alpha/Beta Divergences and Tweedie Models

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Immediate Reward Reinforcement Learning for Projective Kernel Methods

topics about f-divergence

MANY real-word applications require complex nonlinear

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

Transcription:

MACHINE LEARNING REPORTS Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Report 02/2010 Submitted: 01.04.2010 Published:26.04.2010 T. Villmann and S. Haase University of Applied Sciences Mittweida, Department of MPI Technikumplatz 17, 09648 Mittweida - Germany Machine Learning Reports,Research group on Computational Intelligence, http://www.mni.hs-mittweida.de/

Abstract In this paper we offer a systematic approach of the mathematical treatment of the t- Distributed Stochastic Neighbor Embedding (t-sne as well as Stochastic Neighbor Embedding method. In thsi way the theory behind becomes better visible and allows an easy adaptation or exchange of the several modules contained. In particular, this concerns the underlying mathematical structure of the cost function used in the model (divergence function which can now indepentently treated from the other components like data similarity measures or data distributions. Thereby we focus on the utilization of different divergences. This approach requires the consideration of the Fréchet-derivatives of the divergences in use. In this way, the approach can easily adapted to user specific requests. Machine Learning Reports,Research group on Computational Intelligence, http://www.mni.hs-mittweida.de/

1 Introduction Methods of dimensionality reduction are challenging tools in the fields of data analysis and machine learning used in visualization as well as data compression and fusion [?],. Generally, dimensionality reduction methods convert a high dimensional data set X {x} into low dimensional data Ξ {ξ}. A probabilistic approach to visualize the structure of complex data sets, preserving neighbor similarities is Stochastic Neighbor Embedding (SNE, proposed by HINTON and ROWEIS [?]. In [9] VAN DER MAATEN and HINTON presented a technique called t-sne, which is a variation of SNE considering another statistical model assumption for data distributions. Both methods have in common, that a probability distribution over all potential neighbors of a data point in the high-dimensional space is analyzed and described by their pairwise dissimilarities. Both, t-sne and SNE (in a symmetric variant [9], originally minimize a Kullback-Leibler divergence between a joint probability distribution in the high-dimensional space and a joint probability distribution in the low-dimensional space as the underlying cost function, using a gradient descent method. The pairwise similarities in the high-dimensional original data space are set to with conditional probabilities p η ξ p p ξη p η ξ + p ξ η 2 1 dη (1.1 exp ( ξ η 2 / 2σξ 2 ( exp ξ η 2. / 2σξ 2 dη SNE and t-sne differ in the model assumptions according to the distribution in the low-dimensional mapping space, later defined more precisely. In this article we provide the mathematical framework for the generalization of t-sne and SNE, in such a way, that an arbitrary divergence can be used as costfunction for the gradient descent instead of the Kullback-Leibler divergence. The methodology is based on the Fréchet-derivative of the used divergence for the cost function [16],[17]. 2 Derivation of the general cost function gradient for t-sne and SNE 2.1 The t-sne gradient Let D (p q be a divergence for non-negative integrable measure functions p p (r and q q (r with a domain V and ξ, η Ξ distributed according to Π Ξ [2]. Further, Machine Learning Reports,Research group on Computational Intelligence, http://www.mni.hs-mittweida.de/

let r (ξ, η : Ξ Ξ R with the distribution Π r φ (r, Π Ξ. Moreover the constraints p (r 1 and q (r 1 hold for all r V. We denote such measure functions as positive measures. The weight of the functional p is defined as W (p V p (r dr. (2.1 Positive measures p with weight W (p 1 are denoted as (probability density functions. Let us define r ξ η 2. (2.2 For t-sne, q is obtained by means of a Student t-distribution, such that q q (r (ξ, η (1 + r (ξ, η 1 (1 + r (ξ, η 1 dξdη which we will abbreviate below for reasons of clarity as q (r (1 + r 1 (1 + r 1 dξdη f (r I 1. Now let us consider the derivative of D with respect to ξ: D D (p, q (r (ξ, η r dξ dη r (ξ, η dξ dη (ξ, η (ξ, η [2δ ξ,ξ (ξ η 2δ η,ξ (ξ η ] dξ dη [ ] 2 (ξ, η (ξ η + (ξ, η δ η,ξ (η ξ dξ dη 2 (ξ, η (ξ η dη + 2 (ξ, ξ (ξ ξ dξ (ξ η dη (2.3 (ξ, η We now have to consider we get. Again, using the chain rule for functional derivatives (ξ,η (ξ, η (r (ξ, η (r (r (ξ, η dξ dη (2.4 (ξ, η (r Π r dr (2.5 2 Machine Learning Reports

whereby holds, with So we obtain δf (r (r δf (r I 1 f (r 2 δi I δ r,r (1 + r 2 and δi (1 + r 2 (r δ r,r (1 + r 2 + f (r I 2 (1 + r 2 I δ r,r (1 + r 1 f (r + f (r f (r (1 + r 1 I I I δ r,r (1 + r 1 q (r + q (r q (r (1 + r 1 (1 + r 1 q (r (δ r,r q (r. Substituting these results in eq. (2.5, we get (r Π r dr (r (1 + r 1 q (r (r (δ r,r q (r Π r dr ( (1 + r 1 q (r (r (r q (r Π r dr Finally, we can collect all partial results and get D (ξ η dη ( 4 (1 + r 1 q (r (r (r q (r Π r dr (ξ η dη (2.6 We now have the obvious advantage, that we can derive D for several divergences D (p q directly from (2.6, if the Fréchet derivative to q (r is known. (r of D with respect Yet, for the most important classes of divergences, including Kullback-Leibler-, Rényi and Tsallis-divergences, these Fréchet derivatives can be found in [16]. 2.2 The SNE gradient In symmetric SNE, the pairwise similarities in the low dimensional map are given by [9] q SNE q SNE (r (ξ, η exp ( r (ξ, η exp ( r (ξ, η dξdη Machine Learning Reports 3

which we will abbreviate below for reasons of clarity as q SNE (r exp ( r exp ( r dξdη g (r J 1. Consequently, if we consider D, we can use the results from above for t-sne. The only term that differs is the derivative of q SNE (r with respect to r. Therefore we get with which leads to SNE (r δg (r δg (r δ r,r exp ( r and δj J 1 g (r 2 δj J exp ( r SNE (r δ r,r exp ( r + g (r J 2 exp ( r J δ r,r g (r + g (r g (r J J J δ r,r q SNE (r + q SNE (r q SNE (r q SNE (r (δ r,r q SNE (r. Substituting these results in eq. (2.5, we get SNE (r Π r dr SNE (r q SNE (r SNE (r (δ r,r q SNE (r Π r dr ( q SNE (r SNE (r SNE (r q SNE (r Π r dr Finally, substituting this result in eq. (2.3, we obtain D 4 (ξ η dη q SNE (r ( SNE (r SNE (r q SNE (r Π r dr (ξ η dη (2.7 as the general formulation of the SNE cost function gradient, which, again, utilizes the Fréchet-derivatives of the applied divergences as above for t-sne. 3 t-sne gradients for various divergences In this section we explain the t-sne gradients for various divergences. There exist a large variety of different divergences, which can be collected into several classes 4 Machine Learning Reports

according to their mathematical properties and structural behavior. Here we follow the classification proposed in [2]. For this purpose, we plug the Fréchet-derivatives of these divergences or divergence families into the general gradient formula (2.6 for t-sne. Clearly, one can convey these results easily to the general SNE gradient (2.7 in complete analogy, since its structural similarity to the t-sne formula (2.6. A technical remark should be given here: In the following we will abbreviate p (r by p and p (r by p. Further, because the integration variable r is a function r r (ξ, η an integration requires the weighting according to the distribution Π r. Thus, the integration has formally to be carried out according to the differential dπ r (r (Stieltjes-integral. We shorthand this simply to dr but keeping this fact in mind, i.e. by this convention, we ll drop the distribution Π r, if it is clear from the context. 3.1 Kullback-Leibler divergence and other Bregman divergences As a first example we show that we obtain the same result as VAN DER MAATEN and HINTON in [9] for the Kullback-Leibler divergence D KL (p q p log ( p q dr. The Fréchet-derivative of D KL with respect to q is given by From eq. (2.6 we see that KL p q. D KL ( p (1 + r 1 q q ( p (1 + r 1 q q p q q Π r dr (ξ η dη p Π r dr (ξ η dη. (3.1 Since the Integral I p Π r dr in (3.1 can be written as an double integral over all pairs of data points I p dξ dη, we see from (1.1 that the integral I equals 1. So, (3.1 simplifies to D KL ( p (1 + r 1 q q 1 (ξ η dη (1 + r 1 (p q (ξ η dη. (3.2 This formula is exactly the differential form of the discrete version as proposed for t-sne in [9]. Machine Learning Reports 5

The Kullback-Leibler divergence used in original SNE and t-sne belongs to the more general class of Bregman divergences [1]. Another famous representative of this class of divergences is the Itakura-Saito divergence D IS [6], defined as with the Fréchet-derivative D IS (p q IS (p q For the calculation of the gradient D IS (2.6 and obtain D IS 4 4 ( 1 (1 + r 1 q q (1 + r 1 ( q p q [ p q log ( ] p 1 dr q 1 (q p. q2 we substitute the Fréchet-derivative in eq. q p (q p Π 2 r dr (ξ η dη q q p q Π r dr (ξ η dη q (1 + r 1 ( p q 1 + q ( 1 p q Π r dr (ξ η dη. (3.3 One more representative of the class of Bregman-divergences is the norm-like divergence 1 D θ with the parameter θ, [10]: D θ (p q p θ + (θ 1 q θ θ p q (θ 1 dr. The Fréchet-derivative of D θ with respect to q is given by θ (p q θ (1 θ (p q q θ 2. Again, we are interested in the gradient D θ, which is D θ θ (θ 1 (1 + r ((p 1 q q θ 1 q (p q q (θ 1 Π r dr (ξ η dη (3.4 The last example of Bregman-divergences we handle in this paper is the class of β divergences [2],[4], defined as ( 1 D β (p q p β β 1 1 ( p q β 1 β β 1 + q dr. β We use eq. (2.6 and insert the Fréchet-derivative of the β divergences, given by β (p q q β 2 (q p. 1 We note that the norm-like divergence was denoted as η divergence in previous work [16]. We substitute the parameter η with θ in this paper to avoid confusion in denotation. 6 Machine Learning Reports

Thereby the gradient D β reads as D β (1 + r (q 1 β 1 (p q q q (β 1 (p q Π r dr (ξ η dη. (3.5 3.2 Csiszár s f-divergences Next we will consider some divergences belonging to the class of Csiszár s f- divergences [3],[8],[14]. A famous example is the Hellinger divergence [8], defined as D H (p q ( p q 2 dr. With the Fréchet-derivative H (p q 1 p q the gradient of D H with respect to ξ is D H (1 + r 1 ( p q q q ( p q q Π r dr (ξ η dη (1 + r 1 ( p q q p q Π r dr (ξ η dη. (3.6 The α divergence defines an important subclass of f-divergences 1 [p D α (p q α q 1 α α p + (α 1 q ] dr α (α 1 defines an important subclass of f-divergences [2], with the Fréchet-derivative α (p q 1 α ( p α q α 1 can be handled as follows: D α ( (p (1 + r 1 q α q α 1 (p α q ( α 1 q Π r dr (ξ η dη α (p (1 + r (p 1 α q 1 α q q α q (1 α q Π r dr (ξ η dη α (1 + r (p 1 α q 1 α q p α q (1 α Π r dr (ξ η dη. (3.7 α A widely applied divergence, closely related to the α divergences, is the Tsallisdivergence [15], defined as Dα T (p q 1 ( 1 1 α p α q 1 α dr. Machine Learning Reports 7

Utilizing the Fréchet-derivative of D T α with respect to q, that is T α (p q ( α p. q We now can compute the gradient of D T α with respect to ξ from eq. (2.6: D T α (( α ( p p (1 + r 1 α q q Π q q r dr (ξ η dη (1 + r (p 1 α q (1 α q p α q (1 α Π r dr (ξ η dη, (3.8 which is also clear from eq. (3.7, since the Tsallis-divergence is a rescaled version of the α divergence for probability densities. Now, as a last example from the class of Csiszár s f-divergences, we consider the Rényi-divergences [12],[13], which are also closely related to the α divergences and defined as Dα R (p q 1 ( α 1 log with the corresponding Fréchet-derivative p α q 1 α dr Hence, D R α 4 p α q (1 α dr R α (p q p α q α. p α q (1 α dr (1 + r (p 1 α q 1 α q p α q (1 α Π r dr (ξ η dη ( (1 + r 1 p α q 1 α p α q (1 α dr q (ξ η dη. (3.9 3.3 γ divergences A class of very robust divergences with respect to outliers are the γ divergences [5], defined as ( p D γ (p q log γ+1 dr 1 γ(γ+1 ( q γ+1 dr 1 γ+1 ( p qγ dr. 1 γ The Fréchet-derivative of D γ (p q with respect to q is γ (p q [ ] q γ 1 q q γ+1 dr p p qγ dr qγ Q γ p qγ 1 V γ qγ V γ p q γ 1 Q γ Q γ V γ. 8 Machine Learning Reports

Once again, we use eq. (2.6 to calculate the gradient of D γ with respect to ξ : D γ 4 ( (q (1 + r 1 q q γ V γ p q γ 1 Q γ γ V γ p q γ 1 Q γ q Πr dr (ξ η dη Q γ V γ 4 (1 + r 1 q (q γ V γ p q γ 1 Q γ V γ q γ+1 Π r dr + Q γ p q γ Π r dr (ξ η dη Q γ V γ 4 (1 + r 1 q ( q γ V γ p q γ 1 Q γ V γ Q γ + Q γ V γ (ξ η dη Q γ V γ ( p q (1 + r 1 γ p q γ dr qγ+1 (ξ η dη. (3.10 q γ+1 dr For the special choice γ 1 the γ divergence becomes the Cauchy-Schwarz divergence [11],[7]: D CS (p q 1 2 log ( q 2 dr ( p 2 dr log p q dr and the gradient D CS for t-sne can be directly deduced from eq. (3.10: D CS ( p q (1 + r 1 p q dr q2 (ξ η dη. (3.11 q 2 dr Moreover, similar derivations can be made for any other divergence, since one only needs to calculate the Fréchet-derivative of the divergence and apply it to (2.6. 4 Conclusion In this article we provide the mathematical foundation for the generalization of t- SNE and the symmetric variant of SNE. This framework enables the application of any divergence as cost-function for the gradient descent. For this purpose, we first deduced the gradient for t-sne in a complete general case. In the result of that derivation we obtained a tool that enables us to utilize the Fréchet-derivative of any divergence. Thereafter we gave the abstract gradient also for SNE. Finally we calculated the concrete gradient for a wide range of important divergences. These results are summarized in table 1. Machine Learning Reports 9

divergence family formula gradient for t-sne Kullback-Leibler divergence Itakura-Saito divergence norm-like divergence DKL (p q p log DIS (p q [ ( p log p q q ( p q dr 1 ] dr Dθ (p q p θ + (θ 1 q θ θ p q (θ 1 dr β-divergence Dβ (p q p pβ 1 q β 1 dr p β q β dr β 1 β Hellinger divergence α-divergence Dα (p q 1 α(α 1 Tsallis divergence D α T (p q 1 1 α DKL (1 + r 1 (p q (ξ η dη DIS ( (1 + r 1 p 1 + q ( 1 p q q DH (p q ( p q 2 dr DH Dα [p α q 1 α α p + (α 1 q] dr α Πr dr (ξ η dη Dθ θ (θ 1 (1 + r 1 ( (p q q θ 1 q (p q q (θ 1 Πr dr (ξ η dη Dβ (1 + r 1 ( q β 1 (p q q q (β 1 (p q Πr dr (ξ η dη (1 + r 1 ( p q q p q Πr dr (ξ η dη (1 + r 1 (p α q 1 α q p α q (1 α Πr dr (ξ η dη D ( α T (1 + r 1 ( p α q (1 α 1 p α q 1 α dr Rényi divergence D α R (p q 1 log ( p α q 1 α dr D α R α 1 γ divergences Dγ (p q log Cauchy-Schwarz divergence [ ( R p γ+1 dr 1 R 1 γ(γ+1 ( q γ+1 dr γ+1 ( R p q γ dr 1 γ ( DCS (p q 1 R log q 2 dr R p 2 R dr 2 p qdr ] Dγ (1 + r 1 ( p α q 1 α (1 + r 1 ( p q γ q p α q (1 α Πr dr (ξ η dη R p α q (1 α dr q (ξ η dη R p q γ dr R qγ+1 q γ+1 dr DCS ( (1 + r 1 R p q p q R q2 dr q 2 dr (ξ η dη (ξ η dη Table 1: Table of divergences and their t-sne gradient 10 Machine Learning Reports

References [1] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3:200 217, 1967. [2] A. Cichocki, R. Zdunek, A. Phan, and S.-I. Amari. Nonnegative Matrix and Tensor Factorizations. Wiley, Chichester, 2009. [3] I. Csiszár. Information-type measures of differences of probability distributions and indirect observations. Studia Sci. Math. Hungaria, 2:299 318, 1967. [4] S. Eguchi and Y. Kano. Robustifying maximum likelihood estimation. Technical Report 802, Tokyo-Institute of Statistical Mathematics, Tokyo, 2001. [5] H. Fujisawa and S. Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99:2053 2081, 2008. [6] F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelihood method. In Proc. of the 6th. International Congress on Acoustics, volume C, pages 17 20. Tokyo, 1968. [7] R. Jenssen, J. Principe, D. Erdogmus, and T. Eltoft. The Cauchy-Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6:614 629, 2006. [8] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transaction on Information Theory, 52(10:4394 4412, 2005. [9] L. Maaten and G. Hinten. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579 2605, 2008. [10] F. Nielsen and R. Nock. Sided and symmetrized bregman centroids. IEEE Transaction on Information Theory, 55(6:2882 2903, 2009. [11] J. C. Principe, J. F. III, and D. Xu. Information theoretic learning. In S. Haykin, editor, Unsupervised Adaptive Filtering. Wiley, New York, NY, 2000. [12] A. Renyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, 1961. Machine Learning Reports 11

[13] A. Renyi. Probability Theory. North-Holland Publishing Company, Amsterdam, 1970. [14] I. Taneja and P. Kumar. Relative information of type s, Csiszár s f -divergence, and information inequalities. Information Sciences, 166:105 125, 2004. [15] C. Tsallis. Possible generalization of Bolzmann-Gibbs statistics. Journal oft Mathematical Physics, 52:479 487, 1988. [16] T. Villmann and S. Haase. Mathematical aspects of divergence based vector quantization using fréchet-derivatives - extended and revised version -. Machine Learning Reports, 4(MLR-01-2010:1 35, 2010. ISSN:1865-3960, http://www.uni-leipzig.de/ compint/mlr/mlr 01 2010.pdf. [17] T. Villmann, S. Haase, F.-M. Schleif, B. Hammer, and M. Biehl. The mathematics of divergence based online learning in vector quantization. In F. Schwenker and N. Gayar, editors, Artificial Neural Networks in Pattern Recognition Proc. of 4th IAPR Workshop (ANNPR 2010, volume 5998 of LNAI, pages 108 119, Berlin, 2010. Springer. 12 Machine Learning Reports

MACHINE LEARNING REPORTS Machine Learning Reports,Research group on Computational Intelligence, http://www.mni.hs-mittweida.de/

Impressum Machine Learning Reports ISSN: 1865-3960 Publisher/Editors Prof. Dr. rer. nat. Thomas Villmann & Dr. rer. nat. Frank-Michael Schleif University of Applied Sciences Mittweida Technikumplatz 17, 09648 Mittweida - Germany http://www.mni.hs-mittweida.de/ Copyright & Licence Copyright of the articles remains to the authors. Requests regarding the content of the articles should be addressed to the authors. All article are reviewed by at least two researchers in the respective field. Acknowledgments We would like to thank the reviewers for their time and patience. 2 Machine Learning Reports