Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization using arbitrary Divergences

Size: px

Start display at page:

Download "Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization using arbitrary Divergences"

Lilian Reed
6 years ago
Views:

1 Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualiation using arbitrary Divergences Kerstin Bunte a,b, Sven Haase c, Michael Biehl a, Thomas Villmann c a Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, P.O. Box 7, 97AK Groningen - The Netherlands b University of Bielefeld - CITEC Center of Excellence, D-335 Bielefeld - Germany c Department of Mathematics, University of Applied Sciences Mittweida - Germany Abstract We present a systematic approach to the mathematical treatment of the t-distributed Stochastic Neighbor Embedding (t-sne) and the Stochastic Neighbor Embedding (SNE) method. This allows an easy adaptation of the methods or exchange of their respective modules. In particular, the divergence which measures the difference between probability distributions in the original and the embedding space can be treated independently from other components like, e.g., the similarity of data points or the data distribution. We focus on the extension for different divergences and propose a general framework based on the consideration of Fréchet-derivatives. This way the general approach can be adapted to the user specific needs. Keywords: Dimension Reduction, Visualiation, Divergence optimiation, Nonlinear embedding, Stochastic neighbor embedding. Introduction Various dimension reduction techniques have been introduced based on the aim of preserving specific properties of the original data. The spectrum ranges from linear projections of original data, such as Principal Component Analysis (PCA) or classical Multidimensional Scaling address: kerstin.bunte@googl .com (Kerstin Bunte) URL: kbunte (Kerstin Bunte) (MDS) () to a variety of locally linear and nonlinear approaches, such as Isomap (, 3), Locally Linear Embedding (LLE) (), Local Linear Coordination (LLC) (5), or charting (, 7). Other methods aim at the preservation of the classification accuracy in lower dimensions and incorporate the available label information for the embedding, e. g. Linear Discriminant Analysis (LDA) () and generaliations thereof (9), extensions of the Self Organiing Map (SOM) Preprint submitted to Neurocomputing December 3,

2 () incorporating class labels (), and Limited Rank Matrix Learning Vector Quantiation (Li- RaM LVQ) (, 3). For a comprehensive review on nonlinear dimensionality reduction methods, we refer to (). Recently, the Stochastic Neighbor Embedding (SNE) (5) and extensions thereof have become popular for visualiation. SNE approximates the probability distribution in the high-dimensional space, defined by neighboring points, with their probability distribution in a lower-dimensional space. In () the authors proposed a technique called t-sne, which is a variation of SNE considering a particular statistical model assumption for data distributions. The similarity of the distributions is quantified in terms of the Kullback-Leibler divergence. In (7) it is argued that the preservation of shift-invariant similarities as employed by SNE and its variants is superior in comparison to distance preservation as performed by many traditional dimension reduction techniques. Functional metrics like Sobolev distances, kernel-based dissimilarity measures and divergences have attracted attention recently for the processing of data showing a functional structure. These dissimilarity measures were for example investigated as alternatives to the most common choice, the Euclidean distance (, 9,,, ). The application of divergences for Vector quantiation and Learning Vector Quantiation schemes have been investigated in (3, ). This work bases on (5), where the Self Organied Neighbor Embedding (SONE), which can be seen as a hybrid between the Self Organiing Map (SOM) and SNE, has been extended to the use of arbitrary divergences. In this contribution, we formulate a mathematical framework based on Fréchet derivatives which allows to generalie the concept of SNE and t-sne to arbitrary divergences. This leads to a new dimension reduction and visualiation scheme, which can be adapted to the user specific requirements in an actual problem. We summarie the general classes of divergences following the scheme introduced by () and extended in (3). The mathematical framework for functional derivatives of continuous divergences is given by the functional-analytic generaliation of common derivatives, known as Fréchet derivatives (7, ). It is the generaliation of partial derivatives for the discrete variants of the divergences. We introduce a general mathematical framework for the extension of SNE and t-sne for arbitrary divergences. The different classes of divergences are characteried and for various examples the Fréchet derivatives are identified. We demonstrate the proposed framework for the example case of the Gamma divergence. The behavior of different divergences stemming from the identified divergence families are shown on several examples in the image analysis domain.

3 . Review of SNE and t-sne Generally, dimensionality reduction methods convert a highdimensional data set {x i } n i= IRN into low dimensional data {ξ i } n i= IR M. A probabilistic approach to visualie the structure of complex data sets, preserving neighbor similarities is Stochastic Neighbor Embedding (SNE), proposed by Hinton and Roweis (5). SNE converts high-dimensional Euclidean distances between data points into probabilities that represent similarities. Theconditionalprobabilitiesp j i that a data point x i would pick x j as its neighbor is given by p j i = exp( x i x j /σ i) j i exp( x i x j /σ i ), () with p i i =. The variance σ i of the Gaussians centered around x i is determined by a binary search procedure (). The density of the data is likely to vary. In dense regions a smaller value of σ is more appropriate than in sparse regions. Let P i be the conditional probability distribution over all other data points given point x i. This distribution has an entropy which increases as σ i increases. SNE performs a binary search for the value of σ i which produces a P i with a fixed perplexity specified by the user. The perplexity is defined as perpl(p i ) = H(P i), () where H(P i ) is the Shannon entropy of P i measured in bits: H(P i ) = j p j ilog p j i. It can 3 be interpreted as a smooth measure of the effective number of neighbors and typical values ranges between 5 and 5 dependent on the data set sie. The low-dimensional counterparts ξ i and ξ j of the high-dimensional data points x i and x j are modeled by similar probabilities q j i = exp( ξ i ξ j ) j i exp( ξ i ξ j ), (3) with again q i i =. SNE tries to find a lowdimensional data representation which minimies the mismatch between the conditional probabilities p j i and q j i. As a measure of mismatch the Kullback-Leibler divergence D KL isusedsuchthat the cost function if SNE is given by C = i D KL (P i Q i ) = i j p j i log p j i q j i, () whereq i isdefinedsimilartop i astheconditional probability distribution over all other points given ξ i. Thecostfunctionisnotsymmetric andfocuses on retaining the local structure of the data in the mapping. Large costs appear for mapping nearby data points widely separated in the embedding, but there is only small cost for mapping widely separated data points close together. The minimiation of the cost function Eq. () is performed using a gradient descent approach. For details we refer to (5). The so called crowding problem may be observed in SNE and other local techniques, like for example Sammon mapping (). The (even very small) attractive forces might crush together moderately dissimilar points in the center of the

4 map. Therefore, in () van der Maaten and Hinton presented a technique called t-sne, which is a variation of SNE considering another statistical model assumption for the data distribution to avoid that problem. Instead of using the conditional probabilities p j i and q j i the joint probability distributions P and Q are used to optimie a symmetric version of SNE with the cost function C = D KL (P Q) = i j p ij log p ij q ij (5) with p ii = q ii =. Here, the pairwise similarities in the high-dimensional space are defined by the conditional probabilities p ij = p j i +p i j n () and the low-dimensional similarities are given by q ij = (+ ξ i ξ j ) k l (+ ξ. (7) k ξ l ) The application of the heavy-tailed Student t- distribution with one degree of freedom allows to model moderate distances in the high-dimensional space by much larger distances in the embedding. Therefore, the unwanted attractive forces between map points that represent moderately dissimilar data points is eliminated. See () for further details. 3. A generalied framework In this article we provide the mathematical framework for the generaliation of t-sne and SNE, with respect to the use of arbitrary divergences in the cost-function for the gradient descent. We generalie the definitions towards continuous measures in the high-dimensional space X = {x, y} and a low-dimensional space E = {ξ,ζ} IR M. The pairwise similarities in the high-dimensional original data space are set to p = p xy = p y x +p x y dy () with conditional probabilities exp ( x y ) /σx p y x = ( exp x y ). /σx dy 3.. The generalied t-sne gradient Let D(p q) be a divergence for non-negative integrable measure functions p = p(r) and q = q(r) with a domain V and ξ,ζ E distributed according to Π E (). Further, let r(ξ,ζ) : E E IR with the distribution Π r = φ(r,π E ). Let us use the squared Euclidean distance in the low dimensional space: r = r(ξ,ζ) = ξ ζ. (9) For t-sne, q is obtained by means of a Student t-distribution, such that q(r(ξ,ζ )) = (+r(ξ,ζ )) (+r(ξ,ζ)) dξdζ () which we will abbreviate below for reasons of clarity as q(r ) = (+r ) (+r) dξdζ = f (r ) I. ()

5 Now let us consider the derivative of D with respect to ξ: D ξ = D(p,q(r(ξ,ζ))) ξ δd r = δr ξ dξ dζ δd r(ξ,ζ ) = dξ dζ δr(ξ,ζ ) ξ δd = (ξ ζ) dζ () δr(ξ,ζ) δd We now have to consider. Again, using the δr(ξ,ζ) chain rule for functional derivatives we get δd δr(ξ,ζ) = δd (r(ξ,ζ )) dξ dζ (r(ξ,ζ )) δr(ξ,ζ) δd (r ) = (r ) δr Π r dr (3) where (r ) δr holds, with δf(r ) δr So we obtain (r ) δr = δf (r ) δr I f(r ) I δi δr = δ r,r (+r) and δi δr = (+r). = f (r ) I (+r) δ r,r (+r) I = f (r ) f (r) I I (+r) δ r,r (+r) f (r) I = q(r )q(r) (+r) δ r,r (+r) q(r) = (+r) q(r)(δ r,r q(r )). Substituting these results in Eq. (3), we get δd δr = δd (r ) (r ) δr Π r dr = q(r) δd +r (r ) (δ r,r q(r ))Π r dr = q(r) ( ) δd +r (r) δd (r ) q(r )Π r dr. 5 Finally collect all terms and get D δd ξ = (ξ ζ) dζ () δr [ ] q(r) δd = +r (r) δd (r ) q(r )Π r dr (ξ ζ) dζ. We now have the obvious advantage that we can derive D ξ for several divergences D(p q) directly from Eq. (), if the Fréchet derivative δd (r) of D with respect to q(r) is known. The concept of Fréchet derivatives and explicit formulas for different divergences are given in section. 3.. The generalied SNE gradient In symmetric SNE, the pairwise similarities in the low dimensional map are given by () q SNE = q SNE(r(ξ,ζ )) = exp( r(ξ,ζ )) exp( r(ξ,ζ))dξdζ which we will abbreviate below for reasons of clarity as q SNE (r ) = exp( r ) exp( r)dξdζ = g(r ) J. (5) with g(r ) = exp( r ) and J representing the integral in the denominator. Consequently, if we consider D, we can use the results from above for ξ t-sne. The only term that differs is the derivative of q SNE (r ) with respect to r. Therefore we get with SNE (r ) δr δg(r ) δr = δg(r ) δr J g(r ) J δj δr = δ r,r exp( r) and δj δr = exp( r)

6 which leads to SNE (r ) δr = δ r,r exp( r) J = δ r,r g(r) J +g(r )J exp( r) + g(r ) g(r) J J = δ r,r q SNE (r)+q SNE (r )q SNE (r) = q SNE (r)(δ r,r q SNE (r )). Substituting these results in Eq. (3), we get δd δr = δd SNE (r ) Π SNE (r r dr ) δr = q SNE (r) δd SNE (r ) (δ r,r q SNE(r ))Π r dr = q SNE (r) [ δd SNE (r) ] δd SNE (r ) q SNE(r )Π r dr. Finally, substituting this result in Eq. (), we obtain D δd ξ = (ξ ζ) dζ δr [ δd = q SNE (r)(ξ ζ) SNE (r) ] δd SNE (r ) q SNE(r )Π r dr dζ () as the general formulation of the SNE cost function gradient, which uses the Fréchet-derivatives of the applied divergences as above for t-sne. estimators in case of Gaussian noise or error. However, if the observations are corrupted not only by Gaussian noise but also by outliers, estimators based on these metrics can be strongly biased. They also suffer from the curse of dimensionality, which means that observations become equidistant in terms of the Euclidean distance for high-dimensional data. In many applications like pattern matching, image analysis, statistical learning, etc. the noise is not necessarily Gaussian and information divergences are used. Employing generalied divergences might provide a compromise between the efficiency and robustness and/or compromise between a mean squared error and bias. Divergences are functionals D(p q) designed as dissimilarity measures between two nonnegative integrable functions p and q (). In practice, usually p corresponds to the observed data and q denotes the estimated or expected data. We assume p(r) and q(r) are positive measures defined on r in the domain V. The weight of the functional p is defined as W(p) = p(r) dr. (7) V. Specifications of Divergences Divergences are derived from simple component-wise errors, e.g. the Euclidean and Minkowski metrics (). These frequently used metrics are intuitive and they are optimal Positive measures with the additional constraint W(p) = can be interpreted as probability density functions. Generally speaking, divergences measure a quasi-distance or directed difference, while we are mostly interested in separable mea-

7 Gamma Cauchy-Schwar γ = γ Bregman Kullback-Leibler Csisár-f Euclidean β = Eta-div. η = β Beta-div. Itakura-Saito β Prob. Tsallis α Hellinger Rényi generalied Rényi Prob. generalied Kullback-Leibler Alpha-div. related α Figure : Overview over the families of divergences and their relationship to each other. The shortcut Prob. denotes the special case of probability densities. For sake of clarity we show the most important relations only and do not claim completeness. 7 sures, which satisfy the condition > for p q D(p q) () = iff p q. In contrast to a metric, divergences may be non-symmetric D(p q) D(q p), and do not necessarily satisfy the triangular inequality D(p q) D(p ) + D( q). Following () one can distinguish at least three main families of divergences with the same consistent properties: Bregman-divergences, Csisár s f- divergences and γ-divergences. Note that all these families contain the Kullback-Leibler (KL) divergence as a special case, so the KL-divergence can be seen as the non empty intersection between the sets of divergences. In general we assume p and q to be positive measures. In case they are normalied we refer to them as probability densities. We review some basic properties of divergences in the following sections. For detailed information we refer to (, 9). An overview of the family of divergences, examples and their relationship to each other can be found in Figure. Some important properties are summaried in Table and. We review the

8 Table : Table of divergences and their properties. The example divergences inherit the properties of the divergence family (gray box) and sometimes they show additional properties, stated individually. The shortcut (pd) denotes that the divergence is defined only for probability densities. Divergence [generating function] (most) important properties Bregman D φ B (p q) = Entropy Convexity Linearity Invariance three-point Pythagoras [ φ(p) φ(q) δφ(q) [p q]dr] H Φ(p) = Φ(p) dr in p affine transf. property Theorem gen. Kullback-Leibler [Φ(u) = (ulogu u)dr] Itakura-Saito [Φ(u) = logu dr] Shannon Entropy H S (p) = pln(p) Burg Entropy H B (p) = (p) Eta-divergence [Φ(u) = u η dr, η > ] Beta-divergence related to Scaling rescaled [Φ(u) = uβ β u+β β(β ) ] Tsallis Entropy D β (cp cq) = c β D β (p q) Eta-div. Euclidean symmetric [Φ(u) = u ] Gamma divergence related to scale invariant Pythagoras Rényi Entropy D γ(cp cq) = D γ(p q) Theorem Cauchy-Schwar symmetric Cauchy-Schwar γ = inequality

9 Table : Table of divergences and their properties (continued). Divergence (most) important properties 9 gen. Csisár-f D G f (p q) = gen. Entropy Convexity Scaling Invariance symmetry upper bound ( ) c f (p q)dr+ q f p q dr H f (p) = f(p) dr to both p,q cd f = D cf,c > bijective transf. f sym(u) = f(u) + f (u) Csisár f divergence (pd) generalied Entropy Convexity Scaling Invariance symmetry bounded D f (p q) = ) q f dr H f (p) = f(p) dr cd f = D cf,c > bijective transf. f sym(u) = f(u) + f (u) ( p q Alpha divergence related to Convexity Scaling Duality Continuity f( p q ) = p q Hellinger (pd) ( p q) α α α [f( p q ) = ( p q )] + p q α Tsallis Entropy to both p, q D α(cp cq) = cd α(p q) D α = D α Tsallis (pd) Tsallis Entropy rescaled Alpha div.

10 families of divergences and some examples in the following sections... Bregman Divergences A Bregman divergence is defined as a pseudodistance between two positive measures p and q: D B (p q) : L L IR +. Let φ be a strictly convex real-valued function with the domain of the Lebesgue-integrable functions L and twice continuously Fréchet-differentiable(). Then the Bregman divergence can be defined by D φ B (p q) = φ(p) φ(q) δφ(q) [p q]dr, (9) where δφ(q) is the Fréchet derivative of φ with respect to q (3). Well known fundamental properties of the Bregman divergences are (): Convexity. A Bregman divergence is always convex in its first argument but not necessarily in its second. Non-negativity. D φ B (p q) and Dφ B (p q) = iff p q () Linearity. Bregman divergences are linear according to the generating function Φ. Any positive linear combination of Bregman divergences is also a Bregman divergence: D c φ +c φ B (.) = c D φ B (.)+c D φ B (.) c,c > Invariance. A Bregman divergence is invariant under affine transformations. Thus, D Γ B (p q) = D φ B (p q) is valid for any affine transformation Γ(q) = φ(q)+ψ g [q]+c () with linear operator Ψ g [q] = δγ(g) δg q δφ(g) δg for positive measures g and q and scalar c. q () Three-point property. For any triple p, q, g of positive measures the property holds: D φ B (p g) =Dφ B (p q)+dφ B ( (q g)+ δφ(q) (p q) δφ(g) ) δg (3) Generalied Pythagorean theorem. Let P Ω (q) = arg min D φ B (ω q) be the Bregman projection onto ω Ω the convex set Ω and p Ω. The inequality D φ B (p q) Dφ B (p P Ω(q))+D φ B (P Ω(q) q) () is known as generalied Pythagorean theorem. If Ω is an affine set it holds with equality. Optimality. In (3) an optimality property is stated. Given a set S of positive measures p with mean µ = E[S] and µ S the unique minimier E p S [D(p q)] is minimum for q = µ if D is a Bregman divergence. This property favors the Bregman divergences for optimiation and clustering problems (3, 3, 33, 3, 35). The Bregman divergence includes many prominent dissimilarity measures like (, 3, 3): The generalied Kullback-Leibler (or I-) divergence for positive measures p and q: ( ) p D GKL (p q) = plog dr (p q) dr q (5)

11 using the generating function Φ(f) = (f logf f) dr. () Some 3-dim. Isosurfaces for the generalied Kullback-Leibler divergence with respect to different reference points can be found in the first column of Figure and. For probability densities p and q, Eq. (5) simplifies to the Kullback-Leibler divergence (37, 3): ( ) p D KL (p q) = plog dr, (7) q which is related to the Shannon-entropy (39). Equidistance contours for 3-dim. probability densities using Kullback-Leibler divergence with respect to different reference points are displayed in the first row of Figure 3 and 5. The Itakura-Saito divergence (): [ ( ) ] p p D IS (p q) = q log dr () q bases on the Burg entropy, which also serves as the generating function: Φ(f) = log(f) dr. (9) The Itakura-Saito divergence was originally presented as a measure of the quality of fits between two spectra and became a standard measure in the speech and image processing community due to the good perceptual properties of the reconstructed signals. It is known as negative cross-burg entropy and fulfills the scale-invariance property D IS (c p c q) = D IS (p q), which implies the same relative weight is given to low and high components of p, see () for details. The Eta-divergence is also known as normlike divergence (): D η (p q) = p η +(η ) q η η p q η dr (3) with generating function Φ(f) = f η dr for η >. (3) Inthecaseη = theeta-divergence becomes the Euclidean distance with generating function Φ(f) = f dr. The Beta-divergence (): D β (p q) = p pβ q β p β q β dr dr β β (3) with β and β and the generating function Φ(f) = fβ β f +β β(β ). (33) For specific values of β the divergence becomes: β : generalied Kullback-Leibler Eq. (5) β : Itakura-Saito divergence Eq. () β = : Euclidean distance (apart from a factor ). Furthermore the Beta-divergence is equivalent to the density power divergence (3, 3, ) and a rescaled version of the Etadivergence.

12 .. Csisár-f Divergences Csisár-f divergences are connected with the ratio test in the Pearson-Neyman style hypothesis testing and are in many ways natural concerning distributions and statistics. We denote by F the class of convex, real-valued, continuous functions f satisfying f() =, with F = {g g : [, ) IR,g - convex}. (3) Generalied entropy. It corresponds to a generalied f-entropy if the form H f (p) = f(p(r)) dr. (37) Strict convexity. The f-divergence is convex in both arguments p and q: D f (tp +( t)p tq +( t)q ) For a function f F the Csiár f-divergence is given by: D f (p q) = q f ( ) p q dr (35) withthedefinitions f ( ( ) = and f a ) = lim r r f( a) = lima f(u) (5,, 7, ). The f- r u u divergence can be interpreted as an average of the likelihood ratio p q describing the change rate of p with respect to q weighted by the determining function f. For a general f, which does not have to be convex, with f () = c f, this form is not invariant and we have to use the generalied f-divergence D G f (p q) = c f (p q) dr + q f ( ) p q dr. (3) For the special case of probability densities p and q the first term vanishes and the original form of the f-divergences is obtained. Some basic properties of the Csisár f- divergence are (9, ): Non-negativity. D f (p q) wheretheequalsign holds iff p q, which follows from the Jensen s inequality. td f ( p q )+( t)d f (p q ) t [,] (3) Scalability. cd f (p q) = D cf (p q) for any positive constant c >. Invariance. D f (p q) is invariant with respect to a linear shift regarding the function f: e. g. D f (p q) = D f(p q) iff f(u) = f(u) + c (u ) for any constant c IR. Symmetry. For f,f F, where f (u) = u f( u ) denotes the conjugate function of f, the relation D f (p q) = D f (q p) is valid. It is possible to construct a symmetric Csiár f-divergence with f sym (u) = f(u)+f (u) as determining function. Upper bound. The f-divergence is bounded by D f (p q) lim u +{f(u)+f (u)} with u = p q. (39) The existence of this limit for probability densities p and q was shown by Liese and Vajda in (5). Villmann and Haase showed that these bounds still hold for positive measures p and q (3).

13 Monotonicity. The f-divergence is monotonic with respect to the coarse-graining of the underlying domain D of the positive measures p and q, which is similar to the monotonicity of the Fisher metric (7). are: Some well-known examples of f-divergences The subset of Alpha divergences () D α (p q) = α(α ) [p α q ( α) αp+(α )q] dr () is based on the determining function f(u) = u u(α ) α α + u α with u = p q () with α IR\{,}. For specific values of α the divergence becomes (): α : generalied Kullback-Leibler Eq. (5) α : reverse Kullback-Leibler α = : Neyman Chi-square α = : Pearson Chi-square. For α the divergence is ero-forcing, e.g. p(r) = enforces q(r) =. On the other hand, forα ait is ero-avoiding, i.e. q(r) > whenever p(r) >. For α q(r) covers p(r) completely and the Alpha divergence is called inclusive in this case. Furthermore the Beta-divergences can be generated from the Alpha divergences by applying a nonlinear transformation (, 3). 3 The generalied Rényi divergence (5, ) D α GR(p q) = α [p log( α q ( α) αp+(α ) q ] ) dr + () α IR\{,} is closely related to the Alpha divergence. For the special case of probability densities the generalied Rényi divergence reduces to the Rényi divergence (5, 53) D α R(p q) = ( ) α log p α q ( α) dr which bases on the Rényi entropy. The Tsallis-divergences D α T(p q) = α ( ) p α q ( α) dr (3) () for α is a widely applied divergence for probability densities p and q based on the Tsallis entropy. It is also a rescaled version ofthealphadivergence. Inthelimitα it converges to the Kullback-Leibler divergence Eq. (7). The Hellinger divergence () D H (p q) = ( p q) dr (5) with generating function f(u) = ( u) for u = p q p and q. is defined for probability densities

14 Figure : Isosurfaces of some Example divergences including the plane of probability densities with respect to the reference point (.3,.3,.3). The first column shows Bregman divergences, the second Csisár-f divergences and the last column shows the Gamma divergence for different values of γ.

15 Renyi α=.5 Kullback Leibler y.5.5 x.5.5 y.5.5 x x.5 x.... y.5.5 x.5 x 3..5 x...5 x 3 y.5.5 x.5..5 x.. y x.. y.5.5 x Hellinger Gamma divergence γ=.5.5. Gamma divergence γ= x y.5 3 Beta divergence β=.5.5 y Gamma divergence γ=.75 Tsallis α= x y.5. Beta divergence β=.5..5 y.5 y.5..5 Tsallis α= x y.5 Eta divergence η= y.5 Gamma divergence γ= Gamma divergence γ=.5 Alpha divergence α=.5..5 x y.5 Eta divergence η=.5.5 y.5 y Renyi α=.5 Itakura Saito Gamma divergence γ= y x.3. y.5.5 x.5.5 Figure 3: Equidistance lines of some Example divergences divergences for probability densities with respect to reference point (.3,.3,.3). The columns show Bregman divergences, Csisa r-f divergences and Gamma divergences. 5

16 Figure : Isosurfaces of some Example divergences with respect to the reference point (.5,.,.3). The cutoffs show the equidistance lines for this plane. The first column shows Bregman divergences, the second Csisár-f divergences and the last column shows the Gamma divergence for different values of γ.

17 Renyi α=.5 Kullback Leibler y.5.5 x.5.5 y.5.5 x x.5 x.5 y.5.5 x x Beta divergence β=.5.5 x 3 y.5.5 x.5..5 x.. y.5..5 x... Hellinger.5 y.5.5 x.5.5 Gamma divergence γ=.5.5 Gamma divergence γ= x y.5 3 Beta divergence β=.5.5 y x Gamma divergence γ=.75 Tsallis α=.5.5 y.5 y y Tsallis α=.5.5 x..5 x.5 y.5 y.5. Eta divergence η=.5.5. Gamma divergence γ=.5.3 Gamma divergence γ=.5.5 Alpha divergence α=.5. y.5 Eta divergence η=.5.5 x.5 y.5 y.5..5 Renyi α=.5 Itakura Saito Gamma divergence γ= y.5.5 x y.5.5 x.5.5 Figure 5: Equidistance lines of some Example divergences divergences for probability densities with respect to reference point (.5,.,.3). The columns show Bregman divergences, Csisa r-f divergences and Gamma divergences. 7

18 .3. Gamma Divergence The Gamma divergence is very robust with respect to outliers (5) and was proposed by Fujisawa and Eguchi: [ p D γ (p q) = log γ+ dr ] γ +γ [ q γ+ dr ] γ+ ( p qγ dr ) γ () It is robust for γ [,]. In the limit γ it becomes the Kullback-Leibler-divergence D KL (p q) for probability densities. For γ = it becomes the Cauchy-Schwar divergence D CS (p q) = ( log q dr ( ) log p q dr ) p dr, (7) which is based on the quadratic Rényi-entropy. The Cauchy-Schwar divergence is symmetric and was introduced considering the Cauchy-Schwar inequality for norms. It is frequently applied for Paren window estimation, especially suitable for spectral clustering as well as related graph cut problems (55, 5, 57, 3). Some Isosurfaces of the Gamma divergence for different values of γ are shown in the last column of Fig. and. The equidistance lines for the special case of probability densities can be found in the last column of Fig. 3 and 5. The Gamma divergence displays some nice properties (, 3): Invariance. D γ (p q) is invariant under scalar multiplication with positive constants D γ (p q) = D γ (c p c q) c,c >. () In case of positive measures the equation D γ (p q) = holds only if p = c q with c >. For probability densities c = is required. Pythagorean relation. As for Bregman divergences a modified Pythagorean relation between positive measures can be stated for special choices of p,q,ρ. Let p be a distortion of q defined as convex combination with a positive distortion measure φ(r) p ε (r) = ( ε) q(r)+ε φ(r). (9) A positive measure g is denoted as φ-consistent if ν g = ( φ(r)g(r) α dr ) α is sufficiently small for large α >. If two positive measures q and ρ are φ-consistent with respect to a distortion measure φ, then the Pythagorean relation approximately holds for q,ρ and the distortion p ε of q: (p ε,q,ρ) = D γ (p ε ρ) D γ (p ε q) D γ (q ρ) = O(εν γ ) with ν = max{ν q,ν ρ }. (5) This property implies the robustness of D γ according to distortions. 5. Discussion of Divergences In this section we examine and compare some introduced divergences by means of controlled experiments. We investigate the behavior of different divergences for the comparison of images containing an increasing level of (non-linear) noise.

Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Figure :

19 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image moon (top row) together with its histogram is shown on the left side. The following pictures contain noise in form of a linear monotonically increasing transformation of gray values following Eq. (5) using l = [,,...,9] corresponding to the Noise-Levels till 9. Therefore, we compute the histograms of grayvalue images taken from the Berkley segmentation dle of the top row. Some divergences like the is a symmetric band matrix shown in the mid- data set and noisy versions of them. generalied Rényi divergence show numerical instabilities. Others show quite similar behavior, 5.. Linearly monotonically increasing noise e.g. Itakura Saito, Alpha divergences and the In the first experiment the noisy image I is obtained by adding a linear monotonically increas- Beta-divergence with β =.5, but they do not exhibit the desired band structure. For the original image and low noise-levels (images -5) the ing transformation of gray values to the image I: I (x,y) = I(x,y) [l (I(x,y) I )+], (5) Beta-divergence with β =.5, Alpha divergence with α =.5 and also the generalied KL divergence show a bit of the desired band structure. where l denotes the level of noise and I corresponds to the minimal intensity in the original Ignoring the last column and last row (the extreme case) in the dissimilarity matrix of the Eta- image. Figure shows a picture (in the following referred to as moon ) adding different levels of noise following Eq. (5) together with the divergence shows a good approximation of a band matrix. The Gamma divergence is observed to be gray-value histograms. The noise-level is ranged quite robust inthiscase andalsoexhibits avisible from l = to l = 9. Some dissimilarity matrices comparing the ten histograms with different γ = the Gamma divergence equals the Cauchy- band structure for γ. In the special case of divergence measures are shown in Figure 7. The Schwar divergence and is symmetric. Another intuitively ideal dissimilarity matrix in this case 9

20 gen. Kullback Leibler* x 5 5 ideal dissimilarity.5 Gamma divergence γ= Itakura Saito* x 5 generalied Renyi α=.5* 3 Gamma divergence γ=.5 3 Eta divergence η=.5 x Alpha divergence α=.5* x Gamma divergence γ= Eta divergence η=.5 x Alpha divergence α=.5 x Gamma divergence γ=.75 Beta divergence β=.5* x 5 Alpha divergence α=* x Gamma divergence γ=.5 Beta divergence β=.5 x 7 3 Alpha divergence α= * x Gamma divergence γ=.5 Figure 7: Matrix of pairwise dissimilarity of the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. Other divergences, like e.g. the Gamma divergence are more robust. The Eta-divergence ignoring the extreme cases and the Gamma divergence with γ exhibit more of the desired band structure for this example compared to other choices.

21 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image dolphins (top row) together with its histogram is shown on the left side. The following pictures contain noise in form of a linear monotonically increasing transformation of gray values following Eq. (5) using l = [.,.,...,.9] corresponding to the Noise-Levels till 9. symmetric example is the Alpha divergence with the generalied KL and Itakura-Saito, show very α =.5. poor approximations of the desired dissimilarity for this example. As a second example we take a picture of a 5.. Additive uniform noise group of dolphins and add some noise (following Eq. (5)) using the levels l = [.,.,...,.9]. In the second experiment the noisy image I is The resulting histograms of gray values for the obtained by adding uniform noise to the image I: different noise levels are shown in Figure. As I (x, y) = I(x, y) + U(, l), above we compute the matrices of pairwise simi- (5) larities between the histograms using different di- where U(, l) denotes a scalar value drawn from vergences. The results can be found in Figure 9. the uniform distribution in the interval [, l]. In this example the eta-divergence especially with Figure shows the picture of dolphins adding η =.5 is a good approximation of the ideal dis- different levels of uniform noise following Eq. similarity matrix shown in the middle of the top (5) together with the more and more flattened row. The best symmetric choice is the Gamma gray-value histograms. The noise-level is ranged divergence with γ = (Cauchy-Schwar). Fur- from l = thermore, dependent on the value for γ one can ity matrices pairwise comparing the ten images chose between a better resolution (local) and with different divergence measures are shown in a better preservation of the hierarchy of the his- Figure. tograms (global). Some other divergences, e.g. ied Re nyi, Itakura-Saito and some Alpha- and 5 55 to l = Some dissimilar- Some divergences like the general-

22 gen. Kullback Leibler* x ideal dissimilarity.5 Gamma divergence γ=.5.5 Itakura Saito* x generalied Renyi α=.5* Gamma divergence γ=.5.5 Eta divergence η=.5 x Alpha divergence α=.5* x Gamma divergence γ= Eta divergence η=.5 x 5 Alpha divergence α=.5 x 5 Gamma divergence γ=.75.5 Beta divergence β=.5* x Alpha divergence α=* x 5 5 Gamma divergence γ=.5.5 Beta divergence β=.5 x Alpha divergence α= * x 5 5 Gamma divergence γ=.5 Figure 9: Matrix of pairwise dissimilarity of the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. The Eta-divergence especially with η =.5 shows a good approximation of the desired band structure for this example. The Gamma divergence with γ = (Cauchy-Schwar) is the best symmetric choice in this case.

The following pictures contain additive uniform noise following Eq. (5) using 5 5, 55,..., 55 ] corresponding to the Noise-Levels till 9.

Others, like the Gamma-, Eta- and some point f (i. e. in a function f ) in the direction h Alpha- and Beta-divergences are nearly ideal for is formally defined as: this example.

The Fre chet Derivative spaces reduces to the usual partial derivative.

Suppose V and W are Ba- Following (3) we introduce the functional nach spaces and U V is an open subset of V. derivatives of divergences in the next paragraphs.

23 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image dolphins (top row) together with its histogram is shown on the left side. The following pictures contain additive uniform noise following Eq. (5) using 5 5, 55,..., 55 ] corresponding to the Noise-Levels till 9. l = [ 55 Beta-divergences fail to approximate the desired equipped with a norm k k and f, h B are two band structure in the pairwise dissimilarity ma- functionals. The Fre chet derivative trix. Others, like the Gamma-, Eta- and some point f (i. e. in a function f ) in the direction h Alpha- and Beta-divergences are nearly ideal for is formally defined as: this example. The Kullback-Leibler divergence is δl[f ] δf δl[f ] [h]. lim (L[f + ǫh] L[f ]) =: ǫ ǫ δf nearly perfect if the original image is ignored. of L at (5) The Fre chet derivative in finite-dimensional. The Fre chet Derivative spaces reduces to the usual partial derivative. In this section we introduce the concept of Thus, it is a generaliation of the directional Fre chet derivatives used for the generaliation to derivatives. arbitrary divergences. Suppose V and W are Ba- Following (3) we introduce the functional nach spaces and U V is an open subset of V. derivatives of divergences in the next paragraphs. The function f : U W is called Fre chet differ- An overview is given in Table 3. entiable at r U, if there exists a bounded linear.. Fre chet Derivatives: Bregman Divergences operator Ar : V W, such that for h U kf (r + h) f (r) Ar (h)kw =. h khkv lim The Fre chet-derivative of DφB Eq. (9) with re- (53) spect to q is formally given by This general definition can be used for functions L : B IR, defined as mappings from a func- δdφb (pkq) tional Banach space B to IR. Further let B be 3 = δφ(p) δφ(q) δ h δφ(q) (p q) i

24 gen. Kullback Leibler* x 5 5 ideal dissimilarity.5 Gamma divergence γ=..5 Itakura Saito* x 5 5 generalied Renyi α=.5* Gamma divergence γ=.5..5 Eta divergence η=.5 x 5 Alpha divergence α=.5* x Gamma divergence γ= 5..5 Eta divergence η=.5 x 9 Alpha divergence α=.5 x Gamma divergence γ= Beta divergence β=.5* x 5 3 Alpha divergence α=* x 5 Gamma divergence γ=.5.. Beta divergence β=.5 x 5 Alpha divergence α= * x 5 Gamma divergence γ= Figure : Dissimilarity matrices comparing the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. In this example the Eta-, Beta-, Gamma and the Alpha divergences with α =.5 show good approximations of the ideal band structure. Ignoring the original image also KL is nearly perfect. Other divergences like Itakura-Saito and generalied Rényi fail in this example.

25 with [ ] δ δφ(q) (p q) = δ [φ(q)] (p q) δφ(q) For the generalied Kullback-Leibler divergence Eq. (5) this simplifies to δd GKL (p q) = p q +, (55) whereas for the Kullback-Leibler divergence Eq. (7) in the special case of probability densities it reads δd KL (p q) = p q. (5) For the Itakura-Saito divergence Eq. () we get δd IS (p q) = q(q p) (57) and for the Eta-divergence Eq. (3) the Fréchetderivative is δd η (p q) = q (η ) ( η) η (p q). (5) In the case of η = it reduces to the derivative of the Euclidean distance (p q). The Fréchetderivative for the subset of Beta-divergences Eq. (3) is given by δd β (p q) = p q (β ) +q (β ) (59) = q (β ) (q p). ().. Fréchet Derivatives: Csisár-f Divergences For the Csisár-f divergences Eq. (35) the Fréchet derivative is ( ) δd f (p q) p = f q ( ) p = f q +q f(u) δu u. () +q f(u) u p q, () 5 with u = p. For the set of Alpha divergences Eq. q () we get δd α (p q) = α (pα q ( α) ). (3) The related generalied Rényi divergence Eq. () yields δd α GR (p q) = p α q ( α) [pα q ( α) αp+(α )q]dr+ () which reduces in the case of the Rényi divergence for probability densities to δd α R (p q) = pα q ( α) pα q ( α) dr. (5) For the Tsallis divergence Eq. () the Fréchet derivative reads δd α T (p q) = pα q ( α) pα q ( α) dr () and for the well-known Hellinger divergence Eq. (5) the derivative is δd H (p q) = p q. (7).3. Fréchet Derivatives: Gamma Divergences The Fréchet derivative of the Gamma divergence Eq. () can be written as δd γ (p q) = q γ q (γ+) dr p q(γ ) p qγ dr. () Considering the important special case γ =, i.e. Cauchy-Schwar divergence Eq. (7), δd CS (p q) = q q dr p. (9) p q dr

26 Table 3: Table of divergences and their Fréchet derivative Divergence family Formula Fréchet Derivative Bregman divergence gen. Kullback-Leibler Kullback-Leibler Itakura-Saito Eta-divergence Beta-divergence gen. Csisár-f Csisár-f divergence D f (p q) = ( q f D φ δφ(q) B (p q) = φ(p) φ(q) [p q] D GKL (p q) = ( ) p p log dr (p q) dr q D KL (p q) = ( ) p p log dr q D IS (p q) = [ ( ) ] p log p dr q q D η (p q) = p η +(η ) q η η p q (η ) dr D β (p q) = p p(β ) q (β ) dr p β q β dr β β D G f (p q) = c ) f (p q) dr+ q f (p dr, c f = f () p q ) dr q δd φ B (p q) = δφ(p) δφ(q) δd GKL (p q) = p q + δd KL (p q) = p q δd IS (p q) = q (q p) δd η(p q) δd β (p q) δd G f (p q) δd f (p q) (p q)] δ[δφ(q) = q (η ) ( η) η (p q) = q (β ) (q p) ( ) p = f q f(u) p, c q u q f = f () ( ) p = f +q f(u) p q u q Alpha divergence D α (p q) = [p α q ( α) δd αp+(α )q] dr α(p q) = α(α ) α (pα q ( α) ) gen. Rényi D α GR (p q) = α log( [ p α q ( α) αp+(α ) q ] dr+ ) δd α GR (p q) = Rényi D α R (p q) = log( p α q ( α) dr ) δd α R (p q) = pα q ( α) α Tsallis D α T (p q) = Hellinger D H (p q) = Gamma α D γ (p q) = log ( p α q ( α) dr ) δd α T (p q) ( p q) dr [ ( p (γ+) dr) γ(γ+) ( q (γ+) γ+ dr) ( p q γ dr) γ ] δd H (p q) δd γ(p q) = Cauchy-Schwar D CS (p q) = log( q dr p dr ) log ( p q dr ) δd CS (p q) p α q ( α) [p α q ( α) α p+(α )q]dr+ p α q ( α) dr = pα q ( α) p α q ( α) dr = p q q γ q (γ+) dr p q(γ ) p q γ dr = q q dr p p q dr

27 7. t-sne gradients for various divergences In this section we explain the t-sne gradients for various divergences. There exists a large variety of divergences which can be collected into several classes according to their mathematical properties and structural behavior. Here we follow the classification proposed in (). For this purpose, we plug the corresponding Fréchetderivatives into the general gradient Eq. () for t-sne. Clearly, one can convey these results easily to the general SNE gradient Eq. () in complete analogy, because of its structural similarity to the t-sne formula Eq. (). Atechnical remarkshouldbemadehere: Inthe following we will abbreviate p(r) by p and p(r ) by p. Further, because the integration variable r is a function r = r(ξ,ζ) an integration requires the weighting according to the distribution Π r. Thus, the integration has formally to be carried outaccordingtothedifferentialdπ r (r)(stieltjesintegral). We abbreviate this by dr but keeping this fact in mind, i.e. by this convention, we ll drop the distribution Π r, if it is clear from the context. 7.. Bregman divergences In the following we will provide the Gradients for some examples of Bregman divergences introduced in Section.. As a first example we show that we obtain the same result as van der Maaten and Hinton in () for the Kullback-Leibler divergence Eq. (7). The Fréchet-derivative of D KL 7 with respect to q is given in Eq. (5). From Eq. () we see that ( ) D KL q(ξ ζ) p p = ξ (+r) q q q Π r dr dζ ( ) q(ξ ζ) p = (+r) q p Π r dr dζ. (7) Since the Integral I = p Π r dr in Eq. (7) can be written as an double integral over all pairs of data points I = p dξ dζ, we see from Eq. () thattheintegrali equals. So, Eq.(7)simplifies to D KL ξ = = ( ) q p (+r) q (ξ ζ) dζ (+r) (p q)(ξ ζ) dζ. (7) This is exactly the differential form of the discrete version as proposed for t-sne in (). The Kullback-Leibler divergence used in original SNE and t-sne belongs to the more general class of Bregman divergences (3). Another representative of this class of divergences is the Itakura-Saito divergence D IS Eq. () with the Fréchet-derivative Eq. (57). For the calculation of the gradient D IS ξ we substitute the Fréchetderivative in Eq. () and obtain ( D IS q = (q p) ξ +r q ) q p Π q r dr (ξ ζ) dζ [ ] (ξ ζ) p = [ ]Π +r q +q p q r dr dζ. (7)

28 One more Bregman-divergence is the norm-like or Eta-divergence Eq. (3). The Fréchet-derivative ofd η withrespect toq isgivenineq.(5). Again, we are interested in the gradient Dη, which is ξ D η ξ =η(η ) ( ξ ζ (p q)q η q +r (p q )q (η ) Π r dr ) dζ. (73) The last example of Bregman-divergences we handle in this paper is the class of Beta-divergences definedineq.(3). WeuseEq.()andinsertthe Fréchet-derivative of the Beta-divergences, given by Eq. (59). Thereby the gradient D β reads as ξ ( D β ξ ζ ξ = q β (p q) q +r ) q (β ) (p q ) Π r dr dζ. (7) 7.. Csisár s f-divergences Now we will consider some divergences belong- ingtotheclassofcsisár sf-divergences(seesec- tion.). A well-known example is the Hellinger divergence defined in Eq. (5), with the Fréchetderivative Eq. (7). The gradient of D H with respect to ξ is ( D H = p q q q ξ +r ) ( p q q )Π r dr (ξ ζ) dζ [ ] ξ ζ p p = q q q +r Π r dr dζ. (75) For the Alpha divergence, see Eqs. () and (3), we get D α ξ = ( q(ξ ζ) p α q ( α) α +r (p α q ( α) ) ) q Π r dr = α [ ξ ζ p α q ( α) q +r dζ p α q ( α) Π r dr ]dζ. (7) For the Tsallis divergence, Eqs. () and (), we get D T [ [p ] α [ ] ] α (ξ ζ)q p α ξ = q Π r dr dζ +r q q [ ξ ζ = p α q ( α) q p α q ( α) Π r dr ]dζ, +r (77) which isalsoclear fromeq. (7), since thetsallisdivergence is a rescaled version of the Alpha divergence for probability densities. For the Rényi-divergences, Eqs. (3) and (5), the derivative reads D α R ξ = ξ ζ p α q ( α) dr +r ( p α q α q p α q ( α) Π r dr ) dζ ( ξ ζ p α q = )dζ ( α) +r p α q ( α) dr q. (7)

29 9 Divergence family Kullback-Leibler Eq. (7) Itakura-Saito Eq. () Eta-divergence Eq. (3) Beta-divergence Eq. (3) Alpha divergence Eq. () Rényi divergence Eq. (3) Tsallis divergence Eq. () Hellinger divergence Eq. (5) Gamma divergence Eq. () Cauchy-Schwar Eq. (7) Table : Table of divergences and their t-sne gradient Gradients for discrete data Functional gradient for t-sne {x} n i= IRN and {ξ} n i= IRM D KL = ξ ζ (p q)dζ D KL = n ( ξ i ξ j ξ +r ξ i j px +r ξ i ξ j i x j q ) ξ i ξ j D IS = [ [ ] [ ξ ζ p +q ]Π p ξ +r q q r dr D dζ IS = n ξ i ξ j p x i x [ ] ] j +q ξ i +r j ξ i ξ j q ξ i ξ j ξ i ξ j p x k x l q kl ξ k ξ l D η = (η η) [ [ ξ ζ (p q)q (η ) D q η = (η η) n ξ i ξ j (p ξ +r ξ i +r j ξ i ξ j x i x j q ξ i ξ j)q(η ) ξ i ξ j ] [p q ]q (η ) Π r dr dζ q ] [ ξ i ξ j px k x q ] (η ) l ξ k ξ q l ξ k ξ l kl [ [ D β = n ( q β ξ i ξ i ξ j px i x j q ) ξ i ξ j qξ i ξ j D β = ξ ζ q β (p q) q ξ +r ] q (β ) (p q )Π r dr D α = ξ ζ ( ξ α +r p α q α q p α q ( α) Π r dr ) dζ D R α ξ = ξ ζ +r ( p α q α p q α q ( α) dr dζ ) dζ D α ξ i = α kl j n j D R α ξ i = n D T α = ( ξ ζ ξ +r p α q ( α) q p α q ( α) Π r dr ) dζ DT α = n ξ i D H = ξ ζ ( ξ +r p q q p q Π r dr ) dζ D γ ξ = ξ ζ +r D CS ξ = ξ ζ +r ( p q γ p q(γ+) q γ dr ( p q p q q dr q (γ+) dr )dζ q dr )dζ j j D H ξ i = n j D γ = n ξ i j D CS ξ i = n j ξ i ξ j +r ξ i ξ j q (β ) ξ k ξ l ( px k x l q ξ k ξ l ) ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ] [ p α x i x q α j ξ i ξ q j ξ i ξ j p α x k x q ( α) [ l kl] p α x i x jq α ξ i ξ j ξ k ξ l ] q p α ξ i ξ j kl x k x lq( α) ξ k ξ [ l p α x i x q ( α) q ] p α q ( α) j ξ i ξ j x k x l ξ k ξ [ l kl ] pxixjq q px ξiξj ξiξj k x lq ξ k ξ l kl [ px i x jq γ ξ i ξ j q (γ+) kl p x k x l q γ ξ i ξ j ξ k ξ kl [ l q(γ+) ξ k ξ l ] p x i x j q ξ i ξ j q ξ i ξ kl p j x k x l q ξ k ξ l kl q ξ k ξ l ]

30 7.3. Gamma divergences The Fréchet-derivative of D γ (p q) with respect to q is given in Eq. () can be rewritten as [ ] δd γ (p q) =q (γ ) q q (γ+) dr p p qγ dr = qγ Q γ p q(γ ) V γ = qγ V γ p q (γ ) Q γ Q γ V γ. Once again, we use Eq. () to calculate the gradient of D γ with respect to ξ: D γ ξ = [ q(ξ ζ) q γ V γ pq (γ ) Q γ Q γ V γ +r ] (q ) γ V γ p q (γ ) Q γ q Π r dr = ( q(ξ ζ) q γ V γ p q (γ ) Q γ V γ Q γ V γ +r q (γ+) Π r dr +Q γ p q γ Π r dr )dζ = Q γ V γ dζ q(ξ ζ) ( q γ V γ p q γ Q γ +r ) V γ Q γ +Q γ V γ dζ ( ξ ζ p q γ = +r p q γ dr )dζ. q(γ+) (79) q (γ+) dr For the special choice γ = the Gamma divergence becomes the Cauchy-Schwar divergence Eq. (7) and the gradient D CS ξ for t-sne can be directly derived from Eq. (79): ( D CS ξ ζ p q = ξ +r p q dr )dζ q. q dr () Moreover, similar derivations can be made for any other divergence, since one only needs to calculate the Fréchet-derivative of the divergence and apply it to Eq. (). 3. Demonstration of different divergences In this section we demonstrate the use of different divergences in the t-sne method on the bases of the Olivetti faces data set. and the COIL- data set (5). In the experiments we compare one divergence from all three main families: Kullback-Leibler, Rényi and Gamma as example for Bregman-, Csisár-f- and Gamma divergences. For the Gamma divergence we include the special case of Cauchy-Schwar in the choice of the parameter γ and the Rényi divergence is closely related to the Alpha divergence as shown in (). The Olivetti data set consists of intensity-value pictures of individuals with small variations in viewpoint, large variation in expression and occasional addition of glasses. The data set contains images ( per person) of sie. The COIL- data set contains images of different objects viewed from 7 equally spaced orientations. In total we have, images of 3 3 =, pixels. Like suggested in () we preprocessed the data by extracting the mean and reducing the dimension to 3 using PCA and successive transformation to unit variance features. For the experiments we constructed independent random initialiations, which we reused in the algorithm with different divergences and values of the divergence parameter. To compare the different embeddings we use the one near- The Olivetti faces data set is publicly available from roweis/data.html

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

MACHINE LEARNING REPORTS Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Report 02/2010 Submitted: 01.04.2010 Published:26.04.2010 T. Villmann and S. Haase University of Applied