Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization using arbitrary Divergences

Size: px
Start display at page:

Download "Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization using arbitrary Divergences"

Transcription

1 Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualiation using arbitrary Divergences Kerstin Bunte a,b, Sven Haase c, Michael Biehl a, Thomas Villmann c a Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, P.O. Box 7, 97AK Groningen - The Netherlands b University of Bielefeld - CITEC Center of Excellence, D-335 Bielefeld - Germany c Department of Mathematics, University of Applied Sciences Mittweida - Germany Abstract We present a systematic approach to the mathematical treatment of the t-distributed Stochastic Neighbor Embedding (t-sne) and the Stochastic Neighbor Embedding (SNE) method. This allows an easy adaptation of the methods or exchange of their respective modules. In particular, the divergence which measures the difference between probability distributions in the original and the embedding space can be treated independently from other components like, e.g., the similarity of data points or the data distribution. We focus on the extension for different divergences and propose a general framework based on the consideration of Fréchet-derivatives. This way the general approach can be adapted to the user specific needs. Keywords: Dimension Reduction, Visualiation, Divergence optimiation, Nonlinear embedding, Stochastic neighbor embedding. Introduction Various dimension reduction techniques have been introduced based on the aim of preserving specific properties of the original data. The spectrum ranges from linear projections of original data, such as Principal Component Analysis (PCA) or classical Multidimensional Scaling address: kerstin.bunte@googl .com (Kerstin Bunte) URL: kbunte (Kerstin Bunte) (MDS) () to a variety of locally linear and nonlinear approaches, such as Isomap (, 3), Locally Linear Embedding (LLE) (), Local Linear Coordination (LLC) (5), or charting (, 7). Other methods aim at the preservation of the classification accuracy in lower dimensions and incorporate the available label information for the embedding, e. g. Linear Discriminant Analysis (LDA) () and generaliations thereof (9), extensions of the Self Organiing Map (SOM) Preprint submitted to Neurocomputing December 3,

2 () incorporating class labels (), and Limited Rank Matrix Learning Vector Quantiation (Li- RaM LVQ) (, 3). For a comprehensive review on nonlinear dimensionality reduction methods, we refer to (). Recently, the Stochastic Neighbor Embedding (SNE) (5) and extensions thereof have become popular for visualiation. SNE approximates the probability distribution in the high-dimensional space, defined by neighboring points, with their probability distribution in a lower-dimensional space. In () the authors proposed a technique called t-sne, which is a variation of SNE considering a particular statistical model assumption for data distributions. The similarity of the distributions is quantified in terms of the Kullback-Leibler divergence. In (7) it is argued that the preservation of shift-invariant similarities as employed by SNE and its variants is superior in comparison to distance preservation as performed by many traditional dimension reduction techniques. Functional metrics like Sobolev distances, kernel-based dissimilarity measures and divergences have attracted attention recently for the processing of data showing a functional structure. These dissimilarity measures were for example investigated as alternatives to the most common choice, the Euclidean distance (, 9,,, ). The application of divergences for Vector quantiation and Learning Vector Quantiation schemes have been investigated in (3, ). This work bases on (5), where the Self Organied Neighbor Embedding (SONE), which can be seen as a hybrid between the Self Organiing Map (SOM) and SNE, has been extended to the use of arbitrary divergences. In this contribution, we formulate a mathematical framework based on Fréchet derivatives which allows to generalie the concept of SNE and t-sne to arbitrary divergences. This leads to a new dimension reduction and visualiation scheme, which can be adapted to the user specific requirements in an actual problem. We summarie the general classes of divergences following the scheme introduced by () and extended in (3). The mathematical framework for functional derivatives of continuous divergences is given by the functional-analytic generaliation of common derivatives, known as Fréchet derivatives (7, ). It is the generaliation of partial derivatives for the discrete variants of the divergences. We introduce a general mathematical framework for the extension of SNE and t-sne for arbitrary divergences. The different classes of divergences are characteried and for various examples the Fréchet derivatives are identified. We demonstrate the proposed framework for the example case of the Gamma divergence. The behavior of different divergences stemming from the identified divergence families are shown on several examples in the image analysis domain.

3 . Review of SNE and t-sne Generally, dimensionality reduction methods convert a highdimensional data set {x i } n i= IRN into low dimensional data {ξ i } n i= IR M. A probabilistic approach to visualie the structure of complex data sets, preserving neighbor similarities is Stochastic Neighbor Embedding (SNE), proposed by Hinton and Roweis (5). SNE converts high-dimensional Euclidean distances between data points into probabilities that represent similarities. Theconditionalprobabilitiesp j i that a data point x i would pick x j as its neighbor is given by p j i = exp( x i x j /σ i) j i exp( x i x j /σ i ), () with p i i =. The variance σ i of the Gaussians centered around x i is determined by a binary search procedure (). The density of the data is likely to vary. In dense regions a smaller value of σ is more appropriate than in sparse regions. Let P i be the conditional probability distribution over all other data points given point x i. This distribution has an entropy which increases as σ i increases. SNE performs a binary search for the value of σ i which produces a P i with a fixed perplexity specified by the user. The perplexity is defined as perpl(p i ) = H(P i), () where H(P i ) is the Shannon entropy of P i measured in bits: H(P i ) = j p j ilog p j i. It can 3 be interpreted as a smooth measure of the effective number of neighbors and typical values ranges between 5 and 5 dependent on the data set sie. The low-dimensional counterparts ξ i and ξ j of the high-dimensional data points x i and x j are modeled by similar probabilities q j i = exp( ξ i ξ j ) j i exp( ξ i ξ j ), (3) with again q i i =. SNE tries to find a lowdimensional data representation which minimies the mismatch between the conditional probabilities p j i and q j i. As a measure of mismatch the Kullback-Leibler divergence D KL isusedsuchthat the cost function if SNE is given by C = i D KL (P i Q i ) = i j p j i log p j i q j i, () whereq i isdefinedsimilartop i astheconditional probability distribution over all other points given ξ i. Thecostfunctionisnotsymmetric andfocuses on retaining the local structure of the data in the mapping. Large costs appear for mapping nearby data points widely separated in the embedding, but there is only small cost for mapping widely separated data points close together. The minimiation of the cost function Eq. () is performed using a gradient descent approach. For details we refer to (5). The so called crowding problem may be observed in SNE and other local techniques, like for example Sammon mapping (). The (even very small) attractive forces might crush together moderately dissimilar points in the center of the

4 map. Therefore, in () van der Maaten and Hinton presented a technique called t-sne, which is a variation of SNE considering another statistical model assumption for the data distribution to avoid that problem. Instead of using the conditional probabilities p j i and q j i the joint probability distributions P and Q are used to optimie a symmetric version of SNE with the cost function C = D KL (P Q) = i j p ij log p ij q ij (5) with p ii = q ii =. Here, the pairwise similarities in the high-dimensional space are defined by the conditional probabilities p ij = p j i +p i j n () and the low-dimensional similarities are given by q ij = (+ ξ i ξ j ) k l (+ ξ. (7) k ξ l ) The application of the heavy-tailed Student t- distribution with one degree of freedom allows to model moderate distances in the high-dimensional space by much larger distances in the embedding. Therefore, the unwanted attractive forces between map points that represent moderately dissimilar data points is eliminated. See () for further details. 3. A generalied framework In this article we provide the mathematical framework for the generaliation of t-sne and SNE, with respect to the use of arbitrary divergences in the cost-function for the gradient descent. We generalie the definitions towards continuous measures in the high-dimensional space X = {x, y} and a low-dimensional space E = {ξ,ζ} IR M. The pairwise similarities in the high-dimensional original data space are set to p = p xy = p y x +p x y dy () with conditional probabilities exp ( x y ) /σx p y x = ( exp x y ). /σx dy 3.. The generalied t-sne gradient Let D(p q) be a divergence for non-negative integrable measure functions p = p(r) and q = q(r) with a domain V and ξ,ζ E distributed according to Π E (). Further, let r(ξ,ζ) : E E IR with the distribution Π r = φ(r,π E ). Let us use the squared Euclidean distance in the low dimensional space: r = r(ξ,ζ) = ξ ζ. (9) For t-sne, q is obtained by means of a Student t-distribution, such that q(r(ξ,ζ )) = (+r(ξ,ζ )) (+r(ξ,ζ)) dξdζ () which we will abbreviate below for reasons of clarity as q(r ) = (+r ) (+r) dξdζ = f (r ) I. ()

5 Now let us consider the derivative of D with respect to ξ: D ξ = D(p,q(r(ξ,ζ))) ξ δd r = δr ξ dξ dζ δd r(ξ,ζ ) = dξ dζ δr(ξ,ζ ) ξ δd = (ξ ζ) dζ () δr(ξ,ζ) δd We now have to consider. Again, using the δr(ξ,ζ) chain rule for functional derivatives we get δd δr(ξ,ζ) = δd (r(ξ,ζ )) dξ dζ (r(ξ,ζ )) δr(ξ,ζ) δd (r ) = (r ) δr Π r dr (3) where (r ) δr holds, with δf(r ) δr So we obtain (r ) δr = δf (r ) δr I f(r ) I δi δr = δ r,r (+r) and δi δr = (+r). = f (r ) I (+r) δ r,r (+r) I = f (r ) f (r) I I (+r) δ r,r (+r) f (r) I = q(r )q(r) (+r) δ r,r (+r) q(r) = (+r) q(r)(δ r,r q(r )). Substituting these results in Eq. (3), we get δd δr = δd (r ) (r ) δr Π r dr = q(r) δd +r (r ) (δ r,r q(r ))Π r dr = q(r) ( ) δd +r (r) δd (r ) q(r )Π r dr. 5 Finally collect all terms and get D δd ξ = (ξ ζ) dζ () δr [ ] q(r) δd = +r (r) δd (r ) q(r )Π r dr (ξ ζ) dζ. We now have the obvious advantage that we can derive D ξ for several divergences D(p q) directly from Eq. (), if the Fréchet derivative δd (r) of D with respect to q(r) is known. The concept of Fréchet derivatives and explicit formulas for different divergences are given in section. 3.. The generalied SNE gradient In symmetric SNE, the pairwise similarities in the low dimensional map are given by () q SNE = q SNE(r(ξ,ζ )) = exp( r(ξ,ζ )) exp( r(ξ,ζ))dξdζ which we will abbreviate below for reasons of clarity as q SNE (r ) = exp( r ) exp( r)dξdζ = g(r ) J. (5) with g(r ) = exp( r ) and J representing the integral in the denominator. Consequently, if we consider D, we can use the results from above for ξ t-sne. The only term that differs is the derivative of q SNE (r ) with respect to r. Therefore we get with SNE (r ) δr δg(r ) δr = δg(r ) δr J g(r ) J δj δr = δ r,r exp( r) and δj δr = exp( r)

6 which leads to SNE (r ) δr = δ r,r exp( r) J = δ r,r g(r) J +g(r )J exp( r) + g(r ) g(r) J J = δ r,r q SNE (r)+q SNE (r )q SNE (r) = q SNE (r)(δ r,r q SNE (r )). Substituting these results in Eq. (3), we get δd δr = δd SNE (r ) Π SNE (r r dr ) δr = q SNE (r) δd SNE (r ) (δ r,r q SNE(r ))Π r dr = q SNE (r) [ δd SNE (r) ] δd SNE (r ) q SNE(r )Π r dr. Finally, substituting this result in Eq. (), we obtain D δd ξ = (ξ ζ) dζ δr [ δd = q SNE (r)(ξ ζ) SNE (r) ] δd SNE (r ) q SNE(r )Π r dr dζ () as the general formulation of the SNE cost function gradient, which uses the Fréchet-derivatives of the applied divergences as above for t-sne. estimators in case of Gaussian noise or error. However, if the observations are corrupted not only by Gaussian noise but also by outliers, estimators based on these metrics can be strongly biased. They also suffer from the curse of dimensionality, which means that observations become equidistant in terms of the Euclidean distance for high-dimensional data. In many applications like pattern matching, image analysis, statistical learning, etc. the noise is not necessarily Gaussian and information divergences are used. Employing generalied divergences might provide a compromise between the efficiency and robustness and/or compromise between a mean squared error and bias. Divergences are functionals D(p q) designed as dissimilarity measures between two nonnegative integrable functions p and q (). In practice, usually p corresponds to the observed data and q denotes the estimated or expected data. We assume p(r) and q(r) are positive measures defined on r in the domain V. The weight of the functional p is defined as W(p) = p(r) dr. (7) V. Specifications of Divergences Divergences are derived from simple component-wise errors, e.g. the Euclidean and Minkowski metrics (). These frequently used metrics are intuitive and they are optimal Positive measures with the additional constraint W(p) = can be interpreted as probability density functions. Generally speaking, divergences measure a quasi-distance or directed difference, while we are mostly interested in separable mea-

7 Gamma Cauchy-Schwar γ = γ Bregman Kullback-Leibler Csisár-f Euclidean β = Eta-div. η = β Beta-div. Itakura-Saito β Prob. Tsallis α Hellinger Rényi generalied Rényi Prob. generalied Kullback-Leibler Alpha-div. related α Figure : Overview over the families of divergences and their relationship to each other. The shortcut Prob. denotes the special case of probability densities. For sake of clarity we show the most important relations only and do not claim completeness. 7 sures, which satisfy the condition > for p q D(p q) () = iff p q. In contrast to a metric, divergences may be non-symmetric D(p q) D(q p), and do not necessarily satisfy the triangular inequality D(p q) D(p ) + D( q). Following () one can distinguish at least three main families of divergences with the same consistent properties: Bregman-divergences, Csisár s f- divergences and γ-divergences. Note that all these families contain the Kullback-Leibler (KL) divergence as a special case, so the KL-divergence can be seen as the non empty intersection between the sets of divergences. In general we assume p and q to be positive measures. In case they are normalied we refer to them as probability densities. We review some basic properties of divergences in the following sections. For detailed information we refer to (, 9). An overview of the family of divergences, examples and their relationship to each other can be found in Figure. Some important properties are summaried in Table and. We review the

8 Table : Table of divergences and their properties. The example divergences inherit the properties of the divergence family (gray box) and sometimes they show additional properties, stated individually. The shortcut (pd) denotes that the divergence is defined only for probability densities. Divergence [generating function] (most) important properties Bregman D φ B (p q) = Entropy Convexity Linearity Invariance three-point Pythagoras [ φ(p) φ(q) δφ(q) [p q]dr] H Φ(p) = Φ(p) dr in p affine transf. property Theorem gen. Kullback-Leibler [Φ(u) = (ulogu u)dr] Itakura-Saito [Φ(u) = logu dr] Shannon Entropy H S (p) = pln(p) Burg Entropy H B (p) = (p) Eta-divergence [Φ(u) = u η dr, η > ] Beta-divergence related to Scaling rescaled [Φ(u) = uβ β u+β β(β ) ] Tsallis Entropy D β (cp cq) = c β D β (p q) Eta-div. Euclidean symmetric [Φ(u) = u ] Gamma divergence related to scale invariant Pythagoras Rényi Entropy D γ(cp cq) = D γ(p q) Theorem Cauchy-Schwar symmetric Cauchy-Schwar γ = inequality

9 Table : Table of divergences and their properties (continued). Divergence (most) important properties 9 gen. Csisár-f D G f (p q) = gen. Entropy Convexity Scaling Invariance symmetry upper bound ( ) c f (p q)dr+ q f p q dr H f (p) = f(p) dr to both p,q cd f = D cf,c > bijective transf. f sym(u) = f(u) + f (u) Csisár f divergence (pd) generalied Entropy Convexity Scaling Invariance symmetry bounded D f (p q) = ) q f dr H f (p) = f(p) dr cd f = D cf,c > bijective transf. f sym(u) = f(u) + f (u) ( p q Alpha divergence related to Convexity Scaling Duality Continuity f( p q ) = p q Hellinger (pd) ( p q) α α α [f( p q ) = ( p q )] + p q α Tsallis Entropy to both p, q D α(cp cq) = cd α(p q) D α = D α Tsallis (pd) Tsallis Entropy rescaled Alpha div.

10 families of divergences and some examples in the following sections... Bregman Divergences A Bregman divergence is defined as a pseudodistance between two positive measures p and q: D B (p q) : L L IR +. Let φ be a strictly convex real-valued function with the domain of the Lebesgue-integrable functions L and twice continuously Fréchet-differentiable(). Then the Bregman divergence can be defined by D φ B (p q) = φ(p) φ(q) δφ(q) [p q]dr, (9) where δφ(q) is the Fréchet derivative of φ with respect to q (3). Well known fundamental properties of the Bregman divergences are (): Convexity. A Bregman divergence is always convex in its first argument but not necessarily in its second. Non-negativity. D φ B (p q) and Dφ B (p q) = iff p q () Linearity. Bregman divergences are linear according to the generating function Φ. Any positive linear combination of Bregman divergences is also a Bregman divergence: D c φ +c φ B (.) = c D φ B (.)+c D φ B (.) c,c > Invariance. A Bregman divergence is invariant under affine transformations. Thus, D Γ B (p q) = D φ B (p q) is valid for any affine transformation Γ(q) = φ(q)+ψ g [q]+c () with linear operator Ψ g [q] = δγ(g) δg q δφ(g) δg for positive measures g and q and scalar c. q () Three-point property. For any triple p, q, g of positive measures the property holds: D φ B (p g) =Dφ B (p q)+dφ B ( (q g)+ δφ(q) (p q) δφ(g) ) δg (3) Generalied Pythagorean theorem. Let P Ω (q) = arg min D φ B (ω q) be the Bregman projection onto ω Ω the convex set Ω and p Ω. The inequality D φ B (p q) Dφ B (p P Ω(q))+D φ B (P Ω(q) q) () is known as generalied Pythagorean theorem. If Ω is an affine set it holds with equality. Optimality. In (3) an optimality property is stated. Given a set S of positive measures p with mean µ = E[S] and µ S the unique minimier E p S [D(p q)] is minimum for q = µ if D is a Bregman divergence. This property favors the Bregman divergences for optimiation and clustering problems (3, 3, 33, 3, 35). The Bregman divergence includes many prominent dissimilarity measures like (, 3, 3): The generalied Kullback-Leibler (or I-) divergence for positive measures p and q: ( ) p D GKL (p q) = plog dr (p q) dr q (5)

11 using the generating function Φ(f) = (f logf f) dr. () Some 3-dim. Isosurfaces for the generalied Kullback-Leibler divergence with respect to different reference points can be found in the first column of Figure and. For probability densities p and q, Eq. (5) simplifies to the Kullback-Leibler divergence (37, 3): ( ) p D KL (p q) = plog dr, (7) q which is related to the Shannon-entropy (39). Equidistance contours for 3-dim. probability densities using Kullback-Leibler divergence with respect to different reference points are displayed in the first row of Figure 3 and 5. The Itakura-Saito divergence (): [ ( ) ] p p D IS (p q) = q log dr () q bases on the Burg entropy, which also serves as the generating function: Φ(f) = log(f) dr. (9) The Itakura-Saito divergence was originally presented as a measure of the quality of fits between two spectra and became a standard measure in the speech and image processing community due to the good perceptual properties of the reconstructed signals. It is known as negative cross-burg entropy and fulfills the scale-invariance property D IS (c p c q) = D IS (p q), which implies the same relative weight is given to low and high components of p, see () for details. The Eta-divergence is also known as normlike divergence (): D η (p q) = p η +(η ) q η η p q η dr (3) with generating function Φ(f) = f η dr for η >. (3) Inthecaseη = theeta-divergence becomes the Euclidean distance with generating function Φ(f) = f dr. The Beta-divergence (): D β (p q) = p pβ q β p β q β dr dr β β (3) with β and β and the generating function Φ(f) = fβ β f +β β(β ). (33) For specific values of β the divergence becomes: β : generalied Kullback-Leibler Eq. (5) β : Itakura-Saito divergence Eq. () β = : Euclidean distance (apart from a factor ). Furthermore the Beta-divergence is equivalent to the density power divergence (3, 3, ) and a rescaled version of the Etadivergence.

12 .. Csisár-f Divergences Csisár-f divergences are connected with the ratio test in the Pearson-Neyman style hypothesis testing and are in many ways natural concerning distributions and statistics. We denote by F the class of convex, real-valued, continuous functions f satisfying f() =, with F = {g g : [, ) IR,g - convex}. (3) Generalied entropy. It corresponds to a generalied f-entropy if the form H f (p) = f(p(r)) dr. (37) Strict convexity. The f-divergence is convex in both arguments p and q: D f (tp +( t)p tq +( t)q ) For a function f F the Csiár f-divergence is given by: D f (p q) = q f ( ) p q dr (35) withthedefinitions f ( ( ) = and f a ) = lim r r f( a) = lima f(u) (5,, 7, ). The f- r u u divergence can be interpreted as an average of the likelihood ratio p q describing the change rate of p with respect to q weighted by the determining function f. For a general f, which does not have to be convex, with f () = c f, this form is not invariant and we have to use the generalied f-divergence D G f (p q) = c f (p q) dr + q f ( ) p q dr. (3) For the special case of probability densities p and q the first term vanishes and the original form of the f-divergences is obtained. Some basic properties of the Csisár f- divergence are (9, ): Non-negativity. D f (p q) wheretheequalsign holds iff p q, which follows from the Jensen s inequality. td f ( p q )+( t)d f (p q ) t [,] (3) Scalability. cd f (p q) = D cf (p q) for any positive constant c >. Invariance. D f (p q) is invariant with respect to a linear shift regarding the function f: e. g. D f (p q) = D f(p q) iff f(u) = f(u) + c (u ) for any constant c IR. Symmetry. For f,f F, where f (u) = u f( u ) denotes the conjugate function of f, the relation D f (p q) = D f (q p) is valid. It is possible to construct a symmetric Csiár f-divergence with f sym (u) = f(u)+f (u) as determining function. Upper bound. The f-divergence is bounded by D f (p q) lim u +{f(u)+f (u)} with u = p q. (39) The existence of this limit for probability densities p and q was shown by Liese and Vajda in (5). Villmann and Haase showed that these bounds still hold for positive measures p and q (3).

13 Monotonicity. The f-divergence is monotonic with respect to the coarse-graining of the underlying domain D of the positive measures p and q, which is similar to the monotonicity of the Fisher metric (7). are: Some well-known examples of f-divergences The subset of Alpha divergences () D α (p q) = α(α ) [p α q ( α) αp+(α )q] dr () is based on the determining function f(u) = u u(α ) α α + u α with u = p q () with α IR\{,}. For specific values of α the divergence becomes (): α : generalied Kullback-Leibler Eq. (5) α : reverse Kullback-Leibler α = : Neyman Chi-square α = : Pearson Chi-square. For α the divergence is ero-forcing, e.g. p(r) = enforces q(r) =. On the other hand, forα ait is ero-avoiding, i.e. q(r) > whenever p(r) >. For α q(r) covers p(r) completely and the Alpha divergence is called inclusive in this case. Furthermore the Beta-divergences can be generated from the Alpha divergences by applying a nonlinear transformation (, 3). 3 The generalied Rényi divergence (5, ) D α GR(p q) = α [p log( α q ( α) αp+(α ) q ] ) dr + () α IR\{,} is closely related to the Alpha divergence. For the special case of probability densities the generalied Rényi divergence reduces to the Rényi divergence (5, 53) D α R(p q) = ( ) α log p α q ( α) dr which bases on the Rényi entropy. The Tsallis-divergences D α T(p q) = α ( ) p α q ( α) dr (3) () for α is a widely applied divergence for probability densities p and q based on the Tsallis entropy. It is also a rescaled version ofthealphadivergence. Inthelimitα it converges to the Kullback-Leibler divergence Eq. (7). The Hellinger divergence () D H (p q) = ( p q) dr (5) with generating function f(u) = ( u) for u = p q p and q. is defined for probability densities

14 Figure : Isosurfaces of some Example divergences including the plane of probability densities with respect to the reference point (.3,.3,.3). The first column shows Bregman divergences, the second Csisár-f divergences and the last column shows the Gamma divergence for different values of γ.

15 Renyi α=.5 Kullback Leibler y.5.5 x.5.5 y.5.5 x x.5 x.... y.5.5 x.5 x 3..5 x...5 x 3 y.5.5 x.5..5 x.. y x.. y.5.5 x Hellinger Gamma divergence γ=.5.5. Gamma divergence γ= x y.5 3 Beta divergence β=.5.5 y Gamma divergence γ=.75 Tsallis α= x y.5. Beta divergence β=.5..5 y.5 y.5..5 Tsallis α= x y.5 Eta divergence η= y.5 Gamma divergence γ= Gamma divergence γ=.5 Alpha divergence α=.5..5 x y.5 Eta divergence η=.5.5 y.5 y Renyi α=.5 Itakura Saito Gamma divergence γ= y x.3. y.5.5 x.5.5 Figure 3: Equidistance lines of some Example divergences divergences for probability densities with respect to reference point (.3,.3,.3). The columns show Bregman divergences, Csisa r-f divergences and Gamma divergences. 5

16 Figure : Isosurfaces of some Example divergences with respect to the reference point (.5,.,.3). The cutoffs show the equidistance lines for this plane. The first column shows Bregman divergences, the second Csisár-f divergences and the last column shows the Gamma divergence for different values of γ.

17 Renyi α=.5 Kullback Leibler y.5.5 x.5.5 y.5.5 x x.5 x.5 y.5.5 x x Beta divergence β=.5.5 x 3 y.5.5 x.5..5 x.. y.5..5 x... Hellinger.5 y.5.5 x.5.5 Gamma divergence γ=.5.5 Gamma divergence γ= x y.5 3 Beta divergence β=.5.5 y x Gamma divergence γ=.75 Tsallis α=.5.5 y.5 y y Tsallis α=.5.5 x..5 x.5 y.5 y.5. Eta divergence η=.5.5. Gamma divergence γ=.5.3 Gamma divergence γ=.5.5 Alpha divergence α=.5. y.5 Eta divergence η=.5.5 x.5 y.5 y.5..5 Renyi α=.5 Itakura Saito Gamma divergence γ= y.5.5 x y.5.5 x.5.5 Figure 5: Equidistance lines of some Example divergences divergences for probability densities with respect to reference point (.5,.,.3). The columns show Bregman divergences, Csisa r-f divergences and Gamma divergences. 7

18 .3. Gamma Divergence The Gamma divergence is very robust with respect to outliers (5) and was proposed by Fujisawa and Eguchi: [ p D γ (p q) = log γ+ dr ] γ +γ [ q γ+ dr ] γ+ ( p qγ dr ) γ () It is robust for γ [,]. In the limit γ it becomes the Kullback-Leibler-divergence D KL (p q) for probability densities. For γ = it becomes the Cauchy-Schwar divergence D CS (p q) = ( log q dr ( ) log p q dr ) p dr, (7) which is based on the quadratic Rényi-entropy. The Cauchy-Schwar divergence is symmetric and was introduced considering the Cauchy-Schwar inequality for norms. It is frequently applied for Paren window estimation, especially suitable for spectral clustering as well as related graph cut problems (55, 5, 57, 3). Some Isosurfaces of the Gamma divergence for different values of γ are shown in the last column of Fig. and. The equidistance lines for the special case of probability densities can be found in the last column of Fig. 3 and 5. The Gamma divergence displays some nice properties (, 3): Invariance. D γ (p q) is invariant under scalar multiplication with positive constants D γ (p q) = D γ (c p c q) c,c >. () In case of positive measures the equation D γ (p q) = holds only if p = c q with c >. For probability densities c = is required. Pythagorean relation. As for Bregman divergences a modified Pythagorean relation between positive measures can be stated for special choices of p,q,ρ. Let p be a distortion of q defined as convex combination with a positive distortion measure φ(r) p ε (r) = ( ε) q(r)+ε φ(r). (9) A positive measure g is denoted as φ-consistent if ν g = ( φ(r)g(r) α dr ) α is sufficiently small for large α >. If two positive measures q and ρ are φ-consistent with respect to a distortion measure φ, then the Pythagorean relation approximately holds for q,ρ and the distortion p ε of q: (p ε,q,ρ) = D γ (p ε ρ) D γ (p ε q) D γ (q ρ) = O(εν γ ) with ν = max{ν q,ν ρ }. (5) This property implies the robustness of D γ according to distortions. 5. Discussion of Divergences In this section we examine and compare some introduced divergences by means of controlled experiments. We investigate the behavior of different divergences for the comparison of images containing an increasing level of (non-linear) noise.

19 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image moon (top row) together with its histogram is shown on the left side. The following pictures contain noise in form of a linear monotonically increasing transformation of gray values following Eq. (5) using l = [,,...,9] corresponding to the Noise-Levels till 9. Therefore, we compute the histograms of grayvalue images taken from the Berkley segmentation dle of the top row. Some divergences like the is a symmetric band matrix shown in the mid- data set and noisy versions of them. generalied Rényi divergence show numerical instabilities. Others show quite similar behavior, 5.. Linearly monotonically increasing noise e.g. Itakura Saito, Alpha divergences and the In the first experiment the noisy image I is obtained by adding a linear monotonically increas- Beta-divergence with β =.5, but they do not exhibit the desired band structure. For the original image and low noise-levels (images -5) the ing transformation of gray values to the image I: I (x,y) = I(x,y) [l (I(x,y) I )+], (5) Beta-divergence with β =.5, Alpha divergence with α =.5 and also the generalied KL divergence show a bit of the desired band structure. where l denotes the level of noise and I corresponds to the minimal intensity in the original Ignoring the last column and last row (the extreme case) in the dissimilarity matrix of the Eta- image. Figure shows a picture (in the following referred to as moon ) adding different levels of noise following Eq. (5) together with the divergence shows a good approximation of a band matrix. The Gamma divergence is observed to be gray-value histograms. The noise-level is ranged quite robust inthiscase andalsoexhibits avisible from l = to l = 9. Some dissimilarity matrices comparing the ten histograms with different γ = the Gamma divergence equals the Cauchy- band structure for γ. In the special case of divergence measures are shown in Figure 7. The Schwar divergence and is symmetric. Another intuitively ideal dissimilarity matrix in this case 9

20 gen. Kullback Leibler* x 5 5 ideal dissimilarity.5 Gamma divergence γ= Itakura Saito* x 5 generalied Renyi α=.5* 3 Gamma divergence γ=.5 3 Eta divergence η=.5 x Alpha divergence α=.5* x Gamma divergence γ= Eta divergence η=.5 x Alpha divergence α=.5 x Gamma divergence γ=.75 Beta divergence β=.5* x 5 Alpha divergence α=* x Gamma divergence γ=.5 Beta divergence β=.5 x 7 3 Alpha divergence α= * x Gamma divergence γ=.5 Figure 7: Matrix of pairwise dissimilarity of the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. Other divergences, like e.g. the Gamma divergence are more robust. The Eta-divergence ignoring the extreme cases and the Gamma divergence with γ exhibit more of the desired band structure for this example compared to other choices.

21 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image dolphins (top row) together with its histogram is shown on the left side. The following pictures contain noise in form of a linear monotonically increasing transformation of gray values following Eq. (5) using l = [.,.,...,.9] corresponding to the Noise-Levels till 9. symmetric example is the Alpha divergence with the generalied KL and Itakura-Saito, show very α =.5. poor approximations of the desired dissimilarity for this example. As a second example we take a picture of a 5.. Additive uniform noise group of dolphins and add some noise (following Eq. (5)) using the levels l = [.,.,...,.9]. In the second experiment the noisy image I is The resulting histograms of gray values for the obtained by adding uniform noise to the image I: different noise levels are shown in Figure. As I (x, y) = I(x, y) + U(, l), above we compute the matrices of pairwise simi- (5) larities between the histograms using different di- where U(, l) denotes a scalar value drawn from vergences. The results can be found in Figure 9. the uniform distribution in the interval [, l]. In this example the eta-divergence especially with Figure shows the picture of dolphins adding η =.5 is a good approximation of the ideal dis- different levels of uniform noise following Eq. similarity matrix shown in the middle of the top (5) together with the more and more flattened row. The best symmetric choice is the Gamma gray-value histograms. The noise-level is ranged divergence with γ = (Cauchy-Schwar). Fur- from l = thermore, dependent on the value for γ one can ity matrices pairwise comparing the ten images chose between a better resolution (local) and with different divergence measures are shown in a better preservation of the hierarchy of the his- Figure. tograms (global). Some other divergences, e.g. ied Re nyi, Itakura-Saito and some Alpha- and 5 55 to l = Some dissimilar- Some divergences like the general-

22 gen. Kullback Leibler* x ideal dissimilarity.5 Gamma divergence γ=.5.5 Itakura Saito* x generalied Renyi α=.5* Gamma divergence γ=.5.5 Eta divergence η=.5 x Alpha divergence α=.5* x Gamma divergence γ= Eta divergence η=.5 x 5 Alpha divergence α=.5 x 5 Gamma divergence γ=.75.5 Beta divergence β=.5* x Alpha divergence α=* x 5 5 Gamma divergence γ=.5.5 Beta divergence β=.5 x Alpha divergence α= * x 5 5 Gamma divergence γ=.5 Figure 9: Matrix of pairwise dissimilarity of the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. The Eta-divergence especially with η =.5 shows a good approximation of the desired band structure for this example. The Gamma divergence with γ = (Cauchy-Schwar) is the best symmetric choice in this case.

23 Noise Level Noise Level Noise Level Noise Level 3 Noise Level Noise Level 5 Noise Level Noise Level 7 Noise Level Noise Level Figure : Histograms of intensity values in an example picture. The original image dolphins (top row) together with its histogram is shown on the left side. The following pictures contain additive uniform noise following Eq. (5) using 5 5, 55,..., 55 ] corresponding to the Noise-Levels till 9. l = [ 55 Beta-divergences fail to approximate the desired equipped with a norm k k and f, h B are two band structure in the pairwise dissimilarity ma- functionals. The Fre chet derivative trix. Others, like the Gamma-, Eta- and some point f (i. e. in a function f ) in the direction h Alpha- and Beta-divergences are nearly ideal for is formally defined as: this example. The Kullback-Leibler divergence is δl[f ] δf δl[f ] [h]. lim (L[f + ǫh] L[f ]) =: ǫ ǫ δf nearly perfect if the original image is ignored. of L at (5) The Fre chet derivative in finite-dimensional. The Fre chet Derivative spaces reduces to the usual partial derivative. In this section we introduce the concept of Thus, it is a generaliation of the directional Fre chet derivatives used for the generaliation to derivatives. arbitrary divergences. Suppose V and W are Ba- Following (3) we introduce the functional nach spaces and U V is an open subset of V. derivatives of divergences in the next paragraphs. The function f : U W is called Fre chet differ- An overview is given in Table 3. entiable at r U, if there exists a bounded linear.. Fre chet Derivatives: Bregman Divergences operator Ar : V W, such that for h U kf (r + h) f (r) Ar (h)kw =. h khkv lim The Fre chet-derivative of DφB Eq. (9) with re- (53) spect to q is formally given by This general definition can be used for functions L : B IR, defined as mappings from a func- δdφb (pkq) tional Banach space B to IR. Further let B be 3 = δφ(p) δφ(q) δ h δφ(q) (p q) i

24 gen. Kullback Leibler* x 5 5 ideal dissimilarity.5 Gamma divergence γ=..5 Itakura Saito* x 5 5 generalied Renyi α=.5* Gamma divergence γ=.5..5 Eta divergence η=.5 x 5 Alpha divergence α=.5* x Gamma divergence γ= 5..5 Eta divergence η=.5 x 9 Alpha divergence α=.5 x Gamma divergence γ= Beta divergence β=.5* x 5 3 Alpha divergence α=* x 5 Gamma divergence γ=.5.. Beta divergence β=.5 x 5 Alpha divergence α= * x 5 Gamma divergence γ= Figure : Dissimilarity matrices comparing the ten histograms shown in figure using different divergences. The ideal dissimilarity matrix for this example is a band matrix shown in the middle of the top row. Some divergences (marked with an asterisk in the title) show numerical instabilities in case of eros in the signals. In that cases a small constant c = was added to all histograms to prevent the degeneration. In this example the Eta-, Beta-, Gamma and the Alpha divergences with α =.5 show good approximations of the ideal band structure. Ignoring the original image also KL is nearly perfect. Other divergences like Itakura-Saito and generalied Rényi fail in this example.

25 with [ ] δ δφ(q) (p q) = δ [φ(q)] (p q) δφ(q) For the generalied Kullback-Leibler divergence Eq. (5) this simplifies to δd GKL (p q) = p q +, (55) whereas for the Kullback-Leibler divergence Eq. (7) in the special case of probability densities it reads δd KL (p q) = p q. (5) For the Itakura-Saito divergence Eq. () we get δd IS (p q) = q(q p) (57) and for the Eta-divergence Eq. (3) the Fréchetderivative is δd η (p q) = q (η ) ( η) η (p q). (5) In the case of η = it reduces to the derivative of the Euclidean distance (p q). The Fréchetderivative for the subset of Beta-divergences Eq. (3) is given by δd β (p q) = p q (β ) +q (β ) (59) = q (β ) (q p). ().. Fréchet Derivatives: Csisár-f Divergences For the Csisár-f divergences Eq. (35) the Fréchet derivative is ( ) δd f (p q) p = f q ( ) p = f q +q f(u) δu u. () +q f(u) u p q, () 5 with u = p. For the set of Alpha divergences Eq. q () we get δd α (p q) = α (pα q ( α) ). (3) The related generalied Rényi divergence Eq. () yields δd α GR (p q) = p α q ( α) [pα q ( α) αp+(α )q]dr+ () which reduces in the case of the Rényi divergence for probability densities to δd α R (p q) = pα q ( α) pα q ( α) dr. (5) For the Tsallis divergence Eq. () the Fréchet derivative reads δd α T (p q) = pα q ( α) pα q ( α) dr () and for the well-known Hellinger divergence Eq. (5) the derivative is δd H (p q) = p q. (7).3. Fréchet Derivatives: Gamma Divergences The Fréchet derivative of the Gamma divergence Eq. () can be written as δd γ (p q) = q γ q (γ+) dr p q(γ ) p qγ dr. () Considering the important special case γ =, i.e. Cauchy-Schwar divergence Eq. (7), δd CS (p q) = q q dr p. (9) p q dr

26 Table 3: Table of divergences and their Fréchet derivative Divergence family Formula Fréchet Derivative Bregman divergence gen. Kullback-Leibler Kullback-Leibler Itakura-Saito Eta-divergence Beta-divergence gen. Csisár-f Csisár-f divergence D f (p q) = ( q f D φ δφ(q) B (p q) = φ(p) φ(q) [p q] D GKL (p q) = ( ) p p log dr (p q) dr q D KL (p q) = ( ) p p log dr q D IS (p q) = [ ( ) ] p log p dr q q D η (p q) = p η +(η ) q η η p q (η ) dr D β (p q) = p p(β ) q (β ) dr p β q β dr β β D G f (p q) = c ) f (p q) dr+ q f (p dr, c f = f () p q ) dr q δd φ B (p q) = δφ(p) δφ(q) δd GKL (p q) = p q + δd KL (p q) = p q δd IS (p q) = q (q p) δd η(p q) δd β (p q) δd G f (p q) δd f (p q) (p q)] δ[δφ(q) = q (η ) ( η) η (p q) = q (β ) (q p) ( ) p = f q f(u) p, c q u q f = f () ( ) p = f +q f(u) p q u q Alpha divergence D α (p q) = [p α q ( α) δd αp+(α )q] dr α(p q) = α(α ) α (pα q ( α) ) gen. Rényi D α GR (p q) = α log( [ p α q ( α) αp+(α ) q ] dr+ ) δd α GR (p q) = Rényi D α R (p q) = log( p α q ( α) dr ) δd α R (p q) = pα q ( α) α Tsallis D α T (p q) = Hellinger D H (p q) = Gamma α D γ (p q) = log ( p α q ( α) dr ) δd α T (p q) ( p q) dr [ ( p (γ+) dr) γ(γ+) ( q (γ+) γ+ dr) ( p q γ dr) γ ] δd H (p q) δd γ(p q) = Cauchy-Schwar D CS (p q) = log( q dr p dr ) log ( p q dr ) δd CS (p q) p α q ( α) [p α q ( α) α p+(α )q]dr+ p α q ( α) dr = pα q ( α) p α q ( α) dr = p q q γ q (γ+) dr p q(γ ) p q γ dr = q q dr p p q dr

27 7. t-sne gradients for various divergences In this section we explain the t-sne gradients for various divergences. There exists a large variety of divergences which can be collected into several classes according to their mathematical properties and structural behavior. Here we follow the classification proposed in (). For this purpose, we plug the corresponding Fréchetderivatives into the general gradient Eq. () for t-sne. Clearly, one can convey these results easily to the general SNE gradient Eq. () in complete analogy, because of its structural similarity to the t-sne formula Eq. (). Atechnical remarkshouldbemadehere: Inthe following we will abbreviate p(r) by p and p(r ) by p. Further, because the integration variable r is a function r = r(ξ,ζ) an integration requires the weighting according to the distribution Π r. Thus, the integration has formally to be carried outaccordingtothedifferentialdπ r (r)(stieltjesintegral). We abbreviate this by dr but keeping this fact in mind, i.e. by this convention, we ll drop the distribution Π r, if it is clear from the context. 7.. Bregman divergences In the following we will provide the Gradients for some examples of Bregman divergences introduced in Section.. As a first example we show that we obtain the same result as van der Maaten and Hinton in () for the Kullback-Leibler divergence Eq. (7). The Fréchet-derivative of D KL 7 with respect to q is given in Eq. (5). From Eq. () we see that ( ) D KL q(ξ ζ) p p = ξ (+r) q q q Π r dr dζ ( ) q(ξ ζ) p = (+r) q p Π r dr dζ. (7) Since the Integral I = p Π r dr in Eq. (7) can be written as an double integral over all pairs of data points I = p dξ dζ, we see from Eq. () thattheintegrali equals. So, Eq.(7)simplifies to D KL ξ = = ( ) q p (+r) q (ξ ζ) dζ (+r) (p q)(ξ ζ) dζ. (7) This is exactly the differential form of the discrete version as proposed for t-sne in (). The Kullback-Leibler divergence used in original SNE and t-sne belongs to the more general class of Bregman divergences (3). Another representative of this class of divergences is the Itakura-Saito divergence D IS Eq. () with the Fréchet-derivative Eq. (57). For the calculation of the gradient D IS ξ we substitute the Fréchetderivative in Eq. () and obtain ( D IS q = (q p) ξ +r q ) q p Π q r dr (ξ ζ) dζ [ ] (ξ ζ) p = [ ]Π +r q +q p q r dr dζ. (7)

28 One more Bregman-divergence is the norm-like or Eta-divergence Eq. (3). The Fréchet-derivative ofd η withrespect toq isgivenineq.(5). Again, we are interested in the gradient Dη, which is ξ D η ξ =η(η ) ( ξ ζ (p q)q η q +r (p q )q (η ) Π r dr ) dζ. (73) The last example of Bregman-divergences we handle in this paper is the class of Beta-divergences definedineq.(3). WeuseEq.()andinsertthe Fréchet-derivative of the Beta-divergences, given by Eq. (59). Thereby the gradient D β reads as ξ ( D β ξ ζ ξ = q β (p q) q +r ) q (β ) (p q ) Π r dr dζ. (7) 7.. Csisár s f-divergences Now we will consider some divergences belong- ingtotheclassofcsisár sf-divergences(seesec- tion.). A well-known example is the Hellinger divergence defined in Eq. (5), with the Fréchetderivative Eq. (7). The gradient of D H with respect to ξ is ( D H = p q q q ξ +r ) ( p q q )Π r dr (ξ ζ) dζ [ ] ξ ζ p p = q q q +r Π r dr dζ. (75) For the Alpha divergence, see Eqs. () and (3), we get D α ξ = ( q(ξ ζ) p α q ( α) α +r (p α q ( α) ) ) q Π r dr = α [ ξ ζ p α q ( α) q +r dζ p α q ( α) Π r dr ]dζ. (7) For the Tsallis divergence, Eqs. () and (), we get D T [ [p ] α [ ] ] α (ξ ζ)q p α ξ = q Π r dr dζ +r q q [ ξ ζ = p α q ( α) q p α q ( α) Π r dr ]dζ, +r (77) which isalsoclear fromeq. (7), since thetsallisdivergence is a rescaled version of the Alpha divergence for probability densities. For the Rényi-divergences, Eqs. (3) and (5), the derivative reads D α R ξ = ξ ζ p α q ( α) dr +r ( p α q α q p α q ( α) Π r dr ) dζ ( ξ ζ p α q = )dζ ( α) +r p α q ( α) dr q. (7)

29 9 Divergence family Kullback-Leibler Eq. (7) Itakura-Saito Eq. () Eta-divergence Eq. (3) Beta-divergence Eq. (3) Alpha divergence Eq. () Rényi divergence Eq. (3) Tsallis divergence Eq. () Hellinger divergence Eq. (5) Gamma divergence Eq. () Cauchy-Schwar Eq. (7) Table : Table of divergences and their t-sne gradient Gradients for discrete data Functional gradient for t-sne {x} n i= IRN and {ξ} n i= IRM D KL = ξ ζ (p q)dζ D KL = n ( ξ i ξ j ξ +r ξ i j px +r ξ i ξ j i x j q ) ξ i ξ j D IS = [ [ ] [ ξ ζ p +q ]Π p ξ +r q q r dr D dζ IS = n ξ i ξ j p x i x [ ] ] j +q ξ i +r j ξ i ξ j q ξ i ξ j ξ i ξ j p x k x l q kl ξ k ξ l D η = (η η) [ [ ξ ζ (p q)q (η ) D q η = (η η) n ξ i ξ j (p ξ +r ξ i +r j ξ i ξ j x i x j q ξ i ξ j)q(η ) ξ i ξ j ] [p q ]q (η ) Π r dr dζ q ] [ ξ i ξ j px k x q ] (η ) l ξ k ξ q l ξ k ξ l kl [ [ D β = n ( q β ξ i ξ i ξ j px i x j q ) ξ i ξ j qξ i ξ j D β = ξ ζ q β (p q) q ξ +r ] q (β ) (p q )Π r dr D α = ξ ζ ( ξ α +r p α q α q p α q ( α) Π r dr ) dζ D R α ξ = ξ ζ +r ( p α q α p q α q ( α) dr dζ ) dζ D α ξ i = α kl j n j D R α ξ i = n D T α = ( ξ ζ ξ +r p α q ( α) q p α q ( α) Π r dr ) dζ DT α = n ξ i D H = ξ ζ ( ξ +r p q q p q Π r dr ) dζ D γ ξ = ξ ζ +r D CS ξ = ξ ζ +r ( p q γ p q(γ+) q γ dr ( p q p q q dr q (γ+) dr )dζ q dr )dζ j j D H ξ i = n j D γ = n ξ i j D CS ξ i = n j ξ i ξ j +r ξ i ξ j q (β ) ξ k ξ l ( px k x l q ξ k ξ l ) ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ξ i ξ j +r ξ i ξ j ] [ p α x i x q α j ξ i ξ q j ξ i ξ j p α x k x q ( α) [ l kl] p α x i x jq α ξ i ξ j ξ k ξ l ] q p α ξ i ξ j kl x k x lq( α) ξ k ξ [ l p α x i x q ( α) q ] p α q ( α) j ξ i ξ j x k x l ξ k ξ [ l kl ] pxixjq q px ξiξj ξiξj k x lq ξ k ξ l kl [ px i x jq γ ξ i ξ j q (γ+) kl p x k x l q γ ξ i ξ j ξ k ξ kl [ l q(γ+) ξ k ξ l ] p x i x j q ξ i ξ j q ξ i ξ kl p j x k x l q ξ k ξ l kl q ξ k ξ l ]

30 7.3. Gamma divergences The Fréchet-derivative of D γ (p q) with respect to q is given in Eq. () can be rewritten as [ ] δd γ (p q) =q (γ ) q q (γ+) dr p p qγ dr = qγ Q γ p q(γ ) V γ = qγ V γ p q (γ ) Q γ Q γ V γ. Once again, we use Eq. () to calculate the gradient of D γ with respect to ξ: D γ ξ = [ q(ξ ζ) q γ V γ pq (γ ) Q γ Q γ V γ +r ] (q ) γ V γ p q (γ ) Q γ q Π r dr = ( q(ξ ζ) q γ V γ p q (γ ) Q γ V γ Q γ V γ +r q (γ+) Π r dr +Q γ p q γ Π r dr )dζ = Q γ V γ dζ q(ξ ζ) ( q γ V γ p q γ Q γ +r ) V γ Q γ +Q γ V γ dζ ( ξ ζ p q γ = +r p q γ dr )dζ. q(γ+) (79) q (γ+) dr For the special choice γ = the Gamma divergence becomes the Cauchy-Schwar divergence Eq. (7) and the gradient D CS ξ for t-sne can be directly derived from Eq. (79): ( D CS ξ ζ p q = ξ +r p q dr )dζ q. q dr () Moreover, similar derivations can be made for any other divergence, since one only needs to calculate the Fréchet-derivative of the divergence and apply it to Eq. (). 3. Demonstration of different divergences In this section we demonstrate the use of different divergences in the t-sne method on the bases of the Olivetti faces data set. and the COIL- data set (5). In the experiments we compare one divergence from all three main families: Kullback-Leibler, Rényi and Gamma as example for Bregman-, Csisár-f- and Gamma divergences. For the Gamma divergence we include the special case of Cauchy-Schwar in the choice of the parameter γ and the Rényi divergence is closely related to the Alpha divergence as shown in (). The Olivetti data set consists of intensity-value pictures of individuals with small variations in viewpoint, large variation in expression and occasional addition of glasses. The data set contains images ( per person) of sie. The COIL- data set contains images of different objects viewed from 7 equally spaced orientations. In total we have, images of 3 3 =, pixels. Like suggested in () we preprocessed the data by extracting the mean and reducing the dimension to 3 using PCA and successive transformation to unit variance features. For the experiments we constructed independent random initialiations, which we reused in the algorithm with different divergences and values of the divergence parameter. To compare the different embeddings we use the one near- The Olivetti faces data set is publicly available from roweis/data.html

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences MACHINE LEARNING REPORTS Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Report 02/2010 Submitted: 01.04.2010 Published:26.04.2010 T. Villmann and S. Haase University of Applied

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Interpreting Deep Classifiers

Interpreting Deep Classifiers Ruprecht-Karls-University Heidelberg Faculty of Mathematics and Computer Science Seminar: Explainable Machine Learning Interpreting Deep Classifiers by Visual Distillation of Dark Knowledge Author: Daniela

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

t-sne and its theoretical guarantee

t-sne and its theoretical guarantee t-sne and its theoretical guarantee Ziyuan Zhong Columbia University July 4, 2018 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 1 / 72 Overview Timeline: PCA (Karl Pearson, 1901) Manifold Learning(Isomap

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Nonlinear Manifold Learning Summary

Nonlinear Manifold Learning Summary Nonlinear Manifold Learning 6.454 Summary Alexander Ihler ihler@mit.edu October 6, 2003 Abstract Manifold learning is the process of estimating a low-dimensional structure which underlies a collection

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

Apprentissage non supervisée

Apprentissage non supervisée Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016 From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let

More information

Kernel Learning with Bregman Matrix Divergences

Kernel Learning with Bregman Matrix Divergences Kernel Learning with Bregman Matrix Divergences Inderjit S. Dhillon The University of Texas at Austin Workshop on Algorithms for Modern Massive Data Sets Stanford University and Yahoo! Research June 22,

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Lecture: Some Practical Considerations (3 of 4)

Lecture: Some Practical Considerations (3 of 4) Stat260/CS294: Spectral Graph Methods Lecture 14-03/10/2015 Lecture: Some Practical Considerations (3 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough.

More information

DEVELOPMENT OF MORSE THEORY

DEVELOPMENT OF MORSE THEORY DEVELOPMENT OF MORSE THEORY MATTHEW STEED Abstract. In this paper, we develop Morse theory, which allows us to determine topological information about manifolds using certain real-valued functions defined

More information

Calculus in Gauss Space

Calculus in Gauss Space Calculus in Gauss Space 1. The Gradient Operator The -dimensional Lebesgue space is the measurable space (E (E )) where E =[0 1) or E = R endowed with the Lebesgue measure, and the calculus of functions

More information

unsupervised learning

unsupervised learning unsupervised learning INF5860 Machine Learning for Image Analysis Ole-Johan Skrede 18.04.2018 University of Oslo Messages Mandatory exercise 3 is hopefully out next week. No lecture next week. But there

More information

Unsupervised Kernel Dimension Reduction Supplemental Material

Unsupervised Kernel Dimension Reduction Supplemental Material Unsupervised Kernel Dimension Reduction Supplemental Material Meihong Wang Dept. of Computer Science U. of Southern California Los Angeles, CA meihongw@usc.edu Fei Sha Dept. of Computer Science U. of Southern

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008 Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008 Contents Bregman Divergences Bregman Matrix Divergences Relation to Exponential Family Applications Definition Properties Generalization

More information

Mid Term-1 : Practice problems

Mid Term-1 : Practice problems Mid Term-1 : Practice problems These problems are meant only to provide practice; they do not necessarily reflect the difficulty level of the problems in the exam. The actual exam problems are likely to

More information

Basic Properties of Metric and Normed Spaces

Basic Properties of Metric and Normed Spaces Basic Properties of Metric and Normed Spaces Computational and Metric Geometry Instructor: Yury Makarychev The second part of this course is about metric geometry. We will study metric spaces, low distortion

More information

MAP Examples. Sargur Srihari

MAP Examples. Sargur Srihari MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

VECTOR-QUANTIZATION BY DENSITY MATCHING IN THE MINIMUM KULLBACK-LEIBLER DIVERGENCE SENSE

VECTOR-QUANTIZATION BY DENSITY MATCHING IN THE MINIMUM KULLBACK-LEIBLER DIVERGENCE SENSE VECTOR-QUATIZATIO BY DESITY ATCHIG I THE IIU KULLBACK-LEIBLER DIVERGECE SESE Anant Hegde, Deniz Erdogmus, Tue Lehn-Schioler 2, Yadunandana. Rao, Jose C. Principe CEL, Electrical & Computer Engineering

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo September 6, 2011 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Online Nonnegative Matrix Factorization with General Divergences

Online Nonnegative Matrix Factorization with General Divergences Online Nonnegative Matrix Factorization with General Divergences Vincent Y. F. Tan (ECE, Mathematics, NUS) Joint work with Renbo Zhao (NUS) and Huan Xu (GeorgiaTech) IWCT, Shanghai Jiaotong University

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Remarks on Extremization Problems Related To Young s Inequality

Remarks on Extremization Problems Related To Young s Inequality Remarks on Extremization Problems Related To Young s Inequality Michael Christ University of California, Berkeley University of Wisconsin May 18, 2016 Part 1: Introduction Young s convolution inequality

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Non-Negative Matrix Factorization with Quasi-Newton Optimization Non-Negative Matrix Factorization with Quasi-Newton Optimization Rafal ZDUNEK, Andrzej CICHOCKI Laboratory for Advanced Brain Signal Processing BSI, RIKEN, Wako-shi, JAPAN Abstract. Non-negative matrix

More information

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Applications of Information Geometry to Hypothesis Testing and Signal Detection CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Spazi vettoriali e misure di similaritá

Spazi vettoriali e misure di similaritá Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2009-10 March 25, 2010 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

Bayes spaces: use of improper priors and distances between densities

Bayes spaces: use of improper priors and distances between densities Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Olga Kouropteva, Oleg Okun, Matti Pietikäinen Machine Vision Group, Infotech Oulu and

More information

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008 Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections

More information

The Skorokhod reflection problem for functions with discontinuities (contractive case)

The Skorokhod reflection problem for functions with discontinuities (contractive case) The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS Michael A. Lexa and Don H. Johnson Rice University Department of Electrical and Computer Engineering Houston, TX 775-892 amlexa@rice.edu,

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Jensen-Shannon Divergence and Hilbert space embedding

Jensen-Shannon Divergence and Hilbert space embedding Jensen-Shannon Divergence and Hilbert space embedding Bent Fuglede and Flemming Topsøe University of Copenhagen, Department of Mathematics Consider the set M+ 1 (A) of probability distributions where A

More information

PDEs in Image Processing, Tutorials

PDEs in Image Processing, Tutorials PDEs in Image Processing, Tutorials Markus Grasmair Vienna, Winter Term 2010 2011 Direct Methods Let X be a topological space and R: X R {+ } some functional. following definitions: The mapping R is lower

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

A note on the σ-algebra of cylinder sets and all that

A note on the σ-algebra of cylinder sets and all that A note on the σ-algebra of cylinder sets and all that José Luis Silva CCM, Univ. da Madeira, P-9000 Funchal Madeira BiBoS, Univ. of Bielefeld, Germany (luis@dragoeiro.uma.pt) September 1999 Abstract In

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

Multidimensional scaling (MDS)

Multidimensional scaling (MDS) Multidimensional scaling (MDS) Just like SOM and principal curves or surfaces, MDS aims to map data points in R p to a lower-dimensional coordinate system. However, MSD approaches the problem somewhat

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Multivariate class labeling in Robust Soft LVQ

Multivariate class labeling in Robust Soft LVQ Multivariate class labeling in Robust Soft LVQ Petra Schneider, Tina Geweniger 2, Frank-Michael Schleif 3, Michael Biehl 4 and Thomas Villmann 2 - School of Clinical and Experimental Medicine - University

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set Analysis Finite and Infinite Sets Definition. An initial segment is {n N n n 0 }. Definition. A finite set can be put into one-to-one correspondence with an initial segment. The empty set is also considered

More information

Partial cubes: structures, characterizations, and constructions

Partial cubes: structures, characterizations, and constructions Partial cubes: structures, characterizations, and constructions Sergei Ovchinnikov San Francisco State University, Mathematics Department, 1600 Holloway Ave., San Francisco, CA 94132 Abstract Partial cubes

More information

Supervised locally linear embedding

Supervised locally linear embedding Supervised locally linear embedding Dick de Ridder 1, Olga Kouropteva 2, Oleg Okun 2, Matti Pietikäinen 2 and Robert P.W. Duin 1 1 Pattern Recognition Group, Department of Imaging Science and Technology,

More information

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1 CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

Divergence based Learning Vector Quantization

Divergence based Learning Vector Quantization Divergence based Learning Vector Quantization E. Mwebaze 1,2, P. Schneider 2, F.-M. Schleif 3, S. Haase 4, T. Villmann 4, M. Biehl 2 1 Faculty of Computing & IT, Makerere Univ., P.O. Box 7062, Kampala,

More information

A PLANAR SOBOLEV EXTENSION THEOREM FOR PIECEWISE LINEAR HOMEOMORPHISMS

A PLANAR SOBOLEV EXTENSION THEOREM FOR PIECEWISE LINEAR HOMEOMORPHISMS A PLANAR SOBOLEV EXTENSION THEOREM FOR PIECEWISE LINEAR HOMEOMORPHISMS EMANUELA RADICI Abstract. We prove that a planar piecewise linear homeomorphism ϕ defined on the boundary of the square can be extended

More information

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI Jan Tláskal a Václav Kůs Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague 11.11.2010 Concepts What the

More information

Randomized Quantization and Optimal Design with a Marginal Constraint

Randomized Quantization and Optimal Design with a Marginal Constraint Randomized Quantization and Optimal Design with a Marginal Constraint Naci Saldi, Tamás Linder, Serdar Yüksel Department of Mathematics and Statistics, Queen s University, Kingston, ON, Canada Email: {nsaldi,linder,yuksel}@mast.queensu.ca

More information