Pattern learning and recognition on statistical manifolds: An information-geometric review

Size: px
Start display at page:

Download "Pattern learning and recognition on statistical manifolds: An information-geometric review"

Transcription

1 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/64 Pattern learning and recognition on statistical manifolds: An information-geometric review Frank Nielsen Sony Computer Science Laboratories, Inc. July 2013, SIMBAD, York, UK.

2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/64 Praise for Computational Information Geometry What is Information? = Essence of data (datum= thing ) (make it tangible e.g., parameters of generative models) Can we do Intrinsic computing? (unbiased by any particular data representation same results after recoding data) Geometry?! Science of invariance (mother of Science, compass & ruler, Descartes analytic=coordinate/cartesian, imaginaries,...). The open-ended poetic mathematics...

3 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/64 Rationale for Computational Information Geometry Information is...never void! lower bounds Cramér-Rao lower bound and Fisher information (estimation) Bayes error and Chernoff information (classification) Coding and Shannon entropy (communication) Program and Kolmogorov complexity (compression). (Unfortunately not computable!) Geometry: Language (point, line, ball, dimension, orthogonal, projection, geodesic, immersion, etc.) Power of characterization (eg., intersection of two pseudo-segments not admitting closed-form expression) Computing: Information computing. Seeking for mathematical convenience and mathematical tricks (eg., kernel). Do you know the space of functions?!? (Infinity and 1-to-1 mapping, language vs continuum)

4 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/64 This talk: Information-geometric Pattern Recognition Focus on statistical pattern recognition geometric computing. Consider probability distribution families (parameter spaces) and statistical manifolds. Beyond traditional Riemannian and Hilbert sphere representations (p p) Describe dually flat spaces induced by convex functions: Legendre transformation dual and mixed coordinate systems Dual similarities/divergences Computing-friendly dual affine geodesics

5 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/64 SIMBAD 2013 Information-geometric Pattern Recognition By departing from vector-space representations one is confronted with the challenging problem of dealing with (dis)similarities that do not necessarily possess the Euclidean behavior or not even obey the requirements of a metric. The lack of the Euclidean and/or metric properties undermines the very foundations of traditional pattern recognition theories and algorithms, and poses totally new theoretical/computational questions and challenges. Thank you to Prof. Edwin Hancock (U. York, UK) and Prof. Marcello Pelillo (U. Venice, IT)

6 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/64 Statistical Pattern Recognition Models data with distributions (generative) or stochastic processes: parametric (Gaussians, histograms) [model size D], semi-parametric (mixtures) [model size kd], non-parametric (kernel density estimators [model size n], Dirichlet/Gaussian processes [model size D log n],) Data = Pattern ( information) + noise (independent) Statistical machine learning hot topic: Deep learning (restricted Boltzmann machines)

7 Example I: Information-geometric PR (I) Pattern = Gaussian mixture models (universal class) Statistical (dis)similarity/distance: total Bregman divergence (tbd, tkl). Invariance:..., p i (x) N(µ i,σ i ), y = A(x) = Lx +t, y i N(Lµ i +t,lσ i L ), D(X 1 : X 2 ) = D(Y 1 : Y 2 ) (L: any invertible affine transformation, t a translation) Shape Retrieval using Hierarchical Total Bregman Soft Clustering [11], IEEE PAMI, c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/64

8 Example II: Information-geometric PR DTI: diffusion ellipsoids, tensor interpolation. Pattern = zero-centered Gaussians Statistical (dis)similarity/distance: total Bregman divergence (tbd, tkl). Invariance:..., D(A PA : A QA) = D(P : Q), A SL(d): orthogonal matrix (volume/orientation preserving) total Bregman divergence (tbd). (3D rat corpus callosum) Total Bregman Divergence and its Applications to DTI Analysis [34], IEEE TMI c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/64

9 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/64 Statistical mixtures: Generative models of data sets GMM = feature descriptor for information retrieval (IR) classification [21], matching, etc. Increase dimension using color image patches. Low-frequency information encoded into compact statistical model. Generative model statistical image by GMM sampling. A mixture k i=1 w in(µ i,σ i ) is interpreted as a weighted point set in a parameter space: {w i,θ i = (µ i,σ i )} k i=1.

10 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/64 Statistical invariance Riemannian structure (M,g) on {p(x;θ) θ Θ R D } θ-invariance under non-singular parameterization: ρ(p(x;θ),p(x;θ )) = ρ(p(x;λ(θ)),p(x;λ(θ ))) Normal parameterization (µ,σ) or (µ,σ 2 ) yields same distance x-invariance under different x-representation: Sufficient statistics (Fisher, 1922): Pr(X = x t(x) = t,θ) = Pr(X = x T(X) = t) All information for θ is contained in T. Lossless information data reduction (exponential families). Markov kernel = statistical morphism (Chentsov 1972,[6, 7]). A particular Markov kernel is a deterministic mapping T : X Y with y = T(x), p y = p x T 1. Invariance if and only if g Fisher information matrix

11 f-divergences (1960 s) A statistical non-metric distance between two probability measures: I f (p : q) = f ( ) p(x) q(x)dx q(x) f: continuous convex function with f(1) = 0,f (1) = 0,f (1) = 1. asymmetric (not a metric, except TV), modulo affine term. can always be symmetrized using s = f +f, with f (x) = xf(1/x). include many well-known statistical measures: Kullback-Leibler, α-divergences, Hellinger, Chi squared, total variation (TV), etc. f-divergences are the only statistical divergences that preserves equivalence wrt. sufficient statistic mapping: I f (p : q) I f (p M : q M ) with equality if and only if M = T (monotonicity property). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/64

12 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/64 Information geometry: Dually flat spaces, α-geometry Statistical invariance also obtained using (M,g,, ) where and are dual affine connections. Riemannian structure (M,g) is particular case for = = 0, Levi-Civita connection: (M,g) = (M,g, (0), (0) ) Dually flat space (α = ±1) are algorithmically-friendly: Statistical mixtures of exponential families Learning & simplifying mixtures (k-mle) Bregman Voronoi diagrams & dually triangulations Goal: Algorithmics of Gaussians/histograms wrt. Kullback-Leibler divergence.

13 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/64 Exponential Family Mixture Models (EFMMs) Generalize Gaussian & Rayleigh MMs to many usual distributions. m(x) = k i=1 w i p F (x;λ i ) with i w i > 0, k i=1 w i = 1 p F (x;λ) = e t(x),θ F(θ)+k(x) F: log-laplace transform (partition, cumulant function): F(θ) = log e t(x),θ +k(x) dx, θ Θ = { θ the natural parameter space. x X x X d: Dimension of the support X. } e t(x),θ +k(x) dx < D: order of the family (= dimθ). Statistic: t(x) : R d R D.

14 Statistical mixtures: Rayleigh MMs [33, 22] IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: p(x;λ) = x e x2 λ 2 2λ 2 x R + d = 1 (univariate) D = 1 (order 1) θ = 1 2λ 2 Θ = (,0) F(θ) = log( 2θ) t(x) = x 2 k(x) = logx (Weibull k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models (RMMs): for segmentation and classification tasks c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/64

15 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/64 Statistical mixtures: Gaussian MMs [8, 22, 9] Gaussian mixture models (GMMs): model low frequency. Color image interpreted as a 5D xyrgb point set. Gaussian distribution p(x; µ, Σ): 1 e 1 2 D Σ 1 (x µ,x µ) Σ (2π) d 2 Squared Mahalanobis distance: D Q (x,y) = (x y) T Q(x y) x R d d (multivariate) D = d(d+3) 2 (order) θ = (Σ 1 µ, 1 2 Σ 1 ) = (θ v,θ M ) Θ = R S++ d F(θ) = 1 4 θt v θ 1 M θ v 1 2 log θ M + d 2 logπ t(x) = (x, xx T ) k(x) = 0

16 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/64 Sampling from a Gaussian Mixture Model To sample a variate x from a GMM: Choose a component l according to the weight distribution w 1,...,w k, Draw a variate x according to N(µ l,σ l ). Sampling is a doubly stochastic process: throw a biased dice with k faces to choose the component: l Multinomial(w 1,...,w k ) (Multinomial is also an EF, normalized histogram.) then draw at random a variate x from the l-th component x Normal(µ l,σ l ) x = µ+cz with Cholesky: Σ = CC T and z = [z 1... z d ] T standard normal random variate:z i = 2logU 1 cos(2πu 2 )

17 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/64 Relative entropy for exponential families Distance between features (e.g., GMMs) Kullback-Leibler divergence (cross-entropy minus entropy): KL(P : Q) = = p(x)log p(x) q(x) dx 0 p(x)log 1 q(x) dx } {{ } H (P:Q) p(x)log 1 p(x) dx } {{ } H(p)=H (P:P) = F(θ Q ) F(θ P ) θ Q θ P, F(θ P ) = B F (θ Q : θ P ) Bregman divergence B F defined for a strictly convex and differentiable function up to some affine terms. Proof KL(P : Q) = B F (θ Q : θ P ) follows from X E F (θ) = E[t(X)] = F(θ)

18 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/64 Convex duality: Legendre transformation For a strictly convex and differentiable function F : X R: F (y) = sup{ y,x F(x) } x X }{{} l F (y;x); Maximum obtained for y = F(x): x l F (y;x) = y F(x) = 0 y = F(x) Maximum unique from convexity of F ( 2 F 0): Convex conjugates: 2 x l F(y;x) = 2 F(x) 0 (F,X) (F,Y), Y = { F(x) x X}

19 Legendre duality: Geometric interpretation Consider the epigraph of F as a convex object: convex hull (V-representation), versus half-space (H-representation). z O Dual coordinate systems: P = x P H P : y P = F (x P ) Q F H Q : z = (x x Q )F (p)+f(x Q ) HP+ z P = F(x P ) H P : z = (x x P )F (x P )+F(x P ) 0 x P P : (x,f(x)) x (0,F(x P ) x P F (x P ) = F (y P )) Legendre transform also called slope transform. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/64

20 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/64 Legendre duality & Canonical divergence Convex conjugates have functional inverse gradients F 1 = F F may require numerical approximation (not always available in analytical closed-form) Involution: (F ) = F with F = ( F) 1. Convex conjugate F expressed using ( F) 1 : F (y) = x,y F(x),x = y F (y) = ( F) 1 (y),y F(( F) 1 (y)) Fenchel-Young inequality at the heart of canonical divergence: F(x)+F (y) x,y A F (x : y) = A F (y : x) = F(x)+F (y) x,y 0

21 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/64 Dual Bregman divergences & canonical divergence [26] [ KL(P : Q) = E P log p(x) ] 0 q(x) = B F (θ Q : θ P ) = B F (η P : η Q ) = F(θ Q )+F (η P ) θ Q,η P = A F (θ Q : η P ) = A F (η P : θ Q ) with θ Q (natural parameterization) and η P = E P [t(x)] = F(θ P ) (moment parameterization). KL(P : Q) = p(x)log 1 q(x) dx p(x)log 1 p(x) dx } {{ } H (P:Q) Shannon cross-entropy and entropy of EF [26]: } {{ } H(p)=H (P:P) H (P : Q) = F(θ Q ) θ Q, F(θ P ) E P [k(x)] H(P) = F(θ P ) θ P, F(θ P ) E P [k(x)] H(P) = F (η P ) E P [k(x)]

22 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/64 Bregman divergence: Geometric interpretation (I) Potential function F, graph plot F : (x,f(x)). D F (p : q) = F(p) F(q) p q, F(q)

23 Bregman divergence: Geometric interpretation (III) Bregman divergence and path integrals B(θ 1 : θ 2 ) = F(θ 1 ) F(θ 2 ) θ 1 θ 2, F(θ 2 ), (1) = η = F(θ) θ1 θ 2 F(t) F(θ 2 ),dt, (2) η2 = F (t) F (η 1 ),dt, (3) η 1 = B (η 2 : η 1 ) (4) η 1 η 2 θ 2 θ 1 θ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/64

24 Bregman divergence: Geometric interpretation (II) Potential function f, graph plot F : (x,f(x)). B f (p q) = f(p) f(q) (p q)f (q) F ˆq ˆp B f (p q) H q X q p B f (. q): vertical distance between the hyperplane H q tangent to F at lifted point ˆq, and the translated hyperplane at ˆp. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/64

25 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/64 total Bregman divergence (tbd) By analogy to least squares and total least squares total Bregman divergence (tbd) [11, 34, 12] δ f (x,y) = b f (x,y) 1+ f(y) 2 Proved statistical robustness of tbd.

26 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/64 Bregman sided centroids [25, 21] Bregman centroids = unique minimizers of average Bregman divergences (B F convex in right argument) θ = argmin θ 1 n θ = argmin θ 1 n n B F (θ i : θ) i=1 n B F (θ : θ i ) i=1 θ = 1 n n θ i, center of mass, independent of F i=1 θ = ( F) 1 ( 1 n ) n ( F)(θ i ) i=1 Generalized Kolmogorov-Nagumo f-means.

27 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/64 Bregman divergences B F and F-means Bijection quasi-arithmetic means ( F) Bregman divergence B F. Bregman divergence B F F f = F f 1 = (F ) 1 f-mean (entropy/loss function F) (Generalized means) Squared Euclidean distance 1 2 x 2 x x Arithmetic mean (half squared loss) nj=1 1 n x j Kullback-Leibler divergence x log x x log x exp x Geometric mean (Ext. neg. Shannon entropy) Itakura-Saito divergence logx x 1 (Burg entropy) 1 x ( n j=1 x j ) 1 n Harmonic mean n nj=1 1 xj F strictly increasing (like cumulative distribution functions)

28 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/64 Bregman sided centroids [25] Two sided centroids C and C expressed using two θ/η coordinate systems: = 4 equations. C : θ, η C : θ, η C : θ = 1 n n i=1 θ i η = F( θ) C : η = 1 n n i=1 η i θ = F ( η)

29 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/64 Bregman centroids and Jacobian fields Centroid θ zeroes the left-sided Jacobian vector field: ( ) θ w t A(θ : η t ) t Sum of tangent vectors of geodesics from centroid to points = 0 θ A(θ : η ) = η η θ ( t w t A(θ : η t )) = η η since t w t = 1, with η = t w tη t.

30 Bregman information [25] Bregman information = minimum of loss function I F (P) = 1 n = 1 n = 1 n n B F (θ i : θ) i=1 n F(θ i ) F( θ) θ i θ, F( θ) i=1 n F(θ i ) F( θ) i=1 = J F (θ 1,...,θ n ) 1 n n θ i θ, F( θ) i=1 } {{ } =0 Jensen diversity index (e.g., Jensen-Shannon for F(x) = x logx) For squared Euclidean distance, Bregman information = cluster variance, For Kullback-Leibler divergence, Bregman information related to mutual information. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/64

31 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/64 Bregman k-means clustering [4] Bregman k-means: Find k centers C = {C 1,...,C k } that minimizes the loss function: L F (P : C) = P PB F (P : C) B F (P : C) = min B F(P : C i ) i {1,...,k} generalize Lloyd s quadratic error in Vector Quantization (VQ) L F (P : C) = I F (P) I F (C) I F (P) total Bregman information I F (C) between-cluster Bregman information L F (P : C) within-cluster Bregman information total Bregman information = within-cluster Bregman information + between-cluster Bregman information

32 Bregman k-means clustering [4] I F (P) = L F (P : C)+I F (C) Bregman clustering amounts to find the partition C that minimizes the information loss: Bregman k-means : L F = L F (P : C ) = min C (I F (P) I F (C)) Initialize distinct seeds: C 1 = P 1,...,C k = P k Repeat until convergence Assign point Pi to its closest centroid: C i = {P P B F (P : C i ) B F (P : C j ) j i} Update cluster centroids by taking their center of mass: C i = 1 C i P C i P. Loss function monotonically decreases and converges to a local optimum. (Extend to weighted point sets using barycenters.) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/64

33 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/64 Bregman k-means++ [1]: Careful seeding (only?!) (also called Bregman k-medians since min i B1 F (p i : x)). Extend the D 2 -initialization of k-means++ Only seeding stage yields probabilistically guaranteed global approximation factor: Bregman k-means++: Choose C = {C l } for l uniformly random in {1,...,n} While C < k Choose P P with probability B F (P:C) n i=1 B F(P i:c) = B F(P:C) L F (P:C) Yields a O(log k) approximation factor (with high probability). Constant in O( ) depends on ratio of min/max 2 F.

34 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/64 Anisotropic Voronoi diagram (for MVN MMs) [10, 14] Learning mixtures, Talk of O. Schwander on EM, k-mle From the source color image (a), we buid a 5D GMM with k = 32 components, and color each pixel with the mean color of the anisotropic Voronoi cell it belongs to. ( weighted squared Mahalanobis distance per center) logp F (x;θ i ) B F (t(x) : η i )+k(x) Most likely component segmentation Bregman Voronoi diagram (squared Mahalanobis for Gaussians)

35 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/64 Voronoi diagrams in dually flat spaces... Voronoi diagram, dual Delaunay triangulation (general position)

36 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/64 Bregman dual bisectors: Hyperplanes & hypersurfaces [5, 24, 27] Right-sided bisector: Hyperplane (θ-hyperplane) H F : H F (p,q) = {x X B F (x : p) = B F (x : q)}. F(p) F(q),x +(F(p) F(q)+ q, F(q) p, F(p) ) = 0 Left-sided bisector: Hypersurface (η-hyperplane) H F (p,q) = {x X B F(p : x) = B F (q : x)} H F : F(x),q p +F(p) F(q) = 0

37 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/64 Visualizing Bregman bisectors Primal coordinates θ Dual coordinates η natural parameters expectation parameters Gradient Space: Bernouilli p ( , ) q ( , ) D*(p,q )= D*(q,p )= q q Source Space: Logistic loss p p( , ) q( , ) D(p,q)= D(q,p)= p Gradient Space: Itakura-Saito dual p ( , ) q ( , ) D*(p,q )= D*(q,p )= p p Source Space: Itakura-Saito p( , ) q( , ) q D(p,q)= D(q,p)= q

38 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/64 Bregman Voronoi diagrams as minimization diagrams [5] A subclass of affine diagrams which have all non-empty cells. Minimization diagram of the n functions D i (x) = B F (x : p i ) = F(x) F(p i ) x p i, F(p i ). minimization of n linear functions: H i (x) = (p i x) T F(q i ) F(p i )

39 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/64 Bregman dual Delaunay triangulations Delaunay Exponential Hellinger-like empty Bregman sphere property, geodesic triangles. BVDs extends Euclidean Voronoi diagrams with similar complexity/algorithms.

40 Non-commutative Bregman Orthogonality 3-point property (generalized law of cosines): B F (p : r) = B F (p : q)+b F (q : r) (p q) T ( F(r) F(q))) P Γ PQ DF(P R) DF(P Q) ΓQR Q DF(Q R) (pq) θ Bregman orthogonal to (qr) η iff. R B F (p : r) = B F (p : q)+b F (q : r) (Equivalent to θ p θ q,η r η q = 0) Extend Pythagoras theorem (pq) θ F (qr) η F is not commutative except in the squared Euclidean/Mahalanobis case, c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/64

41 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/64 Dually orthogonal Bregman Voronoi & Triangulations Ordinary Voronoi diagram is perpendicular to Delaunay triangulation. Dual line segment geodesics: (pq) θ = {θ = θ p +(1 λ)θ q λ [0,1]} (pq) η = {η = η p +(1 λ)η q λ [0,1]} Bisectors: B θ (p,q) : x,θ q θ p +F(θ p ) F(θ q ) = 0 B η (p,q) : x,η q η p +F (η p ) F (η q ) = 0 Dual orthogonality: B η (p,q) (pq) η (pq) θ B θ (p,q)

42 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/64 Dually orthogonal Bregman Voronoi & Triangulations B η (p,q) (pq) η (pq) θ B θ (p,q)

43 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/64 Simplifying mixture: Kullback-Leibler projection theorem An exponential family mixture model p = k i=1 w ip F (x;θ i ) Right-sided KL barycenter p of components interpreted as the projection of the mixture model p P onto the model exponential family manifold E F [32]: p = arg min p E F KL( p : p) P e-geodesic p p m-geodesic p E F exponential family mixture sub-manifold manifold of probability distribution Right-sided KL centroid = Left-sided Bregman centroid

44 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/64 A simple proof argmin θ KL(m(x) : p θ (x)) = E m [logm] E m [logp θ ] = E m [logm] E m [k(x)] E m [ t(x),θ F(θ)] argmax θ E m [ t(x),θ ] F(θ) = argmax θ E m [t(x)],θ ] F(θ) E m [t(x)] = t w t E pθ t [t(x)] = t w t η t = η F(θ) = η = η θ = F ( η)

45 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/64 Left-sided or right-sided Kullback-Leibler centroids? Left/right Bregman centroids=right/left entropic centroids (KL of exp. fam.) Left-sided/right-sided centroids: different (statistical) properties: Right-sided entropic centroid: zero-avoiding (cover support of pdfs.) Left-sided entropic centroid: zero-forcing (captures highest mode). Left-side Kullback-Leibler centroid (zero-forcing) N2 = (5,0.64) Right-sided Kullback-Leibler centroid zero-avoiding N1 = N( 4,4) Symmetrized centroid

46 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/64 Hierarchical clustering of GMMs (Burbea-Rao) Hierarchical clustering of GMMs wrt. Bhattacharyya distance. Simplify the number of components of an initial GMM. (a) source (b) k = 48 (c) k = 16

47 Two symmetrizations of Bregman divergences Jeffreys-Bregman divergences. S F (p;q) = B F(p,q)+B F (q,p) 2 = 1 p q, F(p) F(q), 2 Jensen-Bregman divergences (diversity index). J F (p;q) = B F(p, p+q 2 )+B F(q, p+q 2 ) 2 = F(p)+F(q) ( ) p +q F 2 2 = BR F (p,q) Skew Jensen divergence [21, 29] J (α) F (p;q) = αf(p)+(1 α)f(q) F(αp+(1 α)q) = BR(α) F (p;q) (Jeffreys and Jensen-Shannon symmetrization of Kullback-Leibler) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/64

48 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/64 Burbea-Rao centroids (α-skewed Jensen centroids) Minimum average divergence: OPT : c = argmin x Equivalent to minimize: E(c) = ( n i=1 n w i α)f(c) i=1 w i J (α) F (x,p i) = argmin x L(x) n w i F(αc +(1 α)p i ) i=1 Sum E = F +G of convex F + concave G function Convex-ConCave Procedure (CCCP) Start from arbitrary c 0, and iteratively update as: F(c t+1 ) = G(c t ) guaranteed convergence to a local minimum.

49 ConCave Convex Procedure (CCCP) min x E(x) = F(x)+G(x) F(c t+1 ) = G(c t ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/64

50 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/64 Iterative algorithm for Burbea-Rao centroids Apply CCCP scheme n F(c t+1 ) = w i F (αc t +(1 α)p i ) i=1 ( n ) c t+1 = F 1 w i F (αc t +(1 α)p i ) i=1 Get arbitrarily fine approximations of the (skew) Burbea-Rao centroids and barycenters. Unique GLOBAL minimum when divergence is separable [21]. Unique GLOBAL minimum for matrix mean [23] for the logdet divergence.

51 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/64 Information-geometric computing on statistical manifolds Cramér-Rao Lower Bound and Information Geometry [13] (overview article) Chernoff information on statistical exponential family manifold [19] Hypothesis testing and Bregman Voronoi diagrams [18] Jeffreys centroid [20] Learning mixtures with k-mle [17] Closed-form divergences for statistical mixtures [16]

52 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/64 Summary Information-geometric pattern recognition: Statistical manifold (M, g): Rao s distance and Fisher-Rao curved Riemannian geometry. Statistical manifold (M,g,, ): dually flat spaces, Bregman divergences, geodesics are straight lines in either θ/η parameter space. ±1-geometry, special case of α-geometry Clustering in dually flat spaces Software library: jmef [8] (Java), pymef [31] (Python)... but also many other geometries to explore: Hilbertian, Finsler [3], Kähler, Wasserstein, Contact, Symplectic, etc. (it is easy to require non-euclidean geometry but then space is wild open!)

53 Edited book (MIG) and proceedings (GSI) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/64

54 Thank you. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/64

55 Exponential families & statistical distances Universal density estimators [2] generalizing Gaussians/histograms (single EF density approximates any smooth density) Explicit formula for Shannon entropy, cross-entropy, and Kullback-Leibler divergence [26]: Rényi/Tsallis entropy and divergence [28] Sharma-Mittal entropy and divergence [30]. A 2-parameter family extending extensive Rényi (for β 1) and non-extensive Tsallis entropies (for β α) ( ( ) H α,β (p) = 1 p(x) dx)1 β α 1 α 1, 1 β with α > 0,α 1,β 1. Skew Jensen and Burbea-Rao divergence [21] Chernoff information and divergence [15] Mixtures: total Least square, Jensen-Rényi, Cauchy-Schwarz divergence [16]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/64

56 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/64 Statistical invariance: Markov kernel Probability family: p(x; θ). (X,σ) and (X,σ ) two measurable spaces. σ: A σ-algebra on X (non-empty, closed under complementation and countable union). Markov kernel = transition probability kernel K : X σ [0,1]: E σ,k(,e ) measurable map, x X,K(x, ) is a probability measure on (X,σ ). p a pm. on (X,σ) induces Kp a pm., with Kp(E ) = K(x,E )p(dx), E σ X

57 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/64 Space of Bregman spheres and Bregman balls [5] Dual Bregman balls (bounding Bregman spheres): Ball r F (c,r) = {x X B F(x : c) r} and Ball l F (c,r) = {x X B F(c : x) r} Legendre duality: Ball l F (c,r) = ( F) 1 (Ball r F ( F(c),r)) Illustration for Itakura-Saito divergence, F(x) = log x

58 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/64 Space of Bregman spheres: Lifting map [5] F : x ˆx = (x,f(x)), hypersurface in R d+1. H p : Tangent hyperplane at ˆp, z = H p (x) = x p, F(p) +F(p) Bregman sphere σ ˆσ with supporting hyperplane H σ : z = x c, F(c) +F(c)+r. (// to H c and shifted vertically by r) ˆσ = F H σ. intersection of any hyperplane H with F projects onto X as a Bregman sphere: H : z = x,a +b σ : Ball F (c = ( F) 1 (a),r = a,c F(c)+b)

59 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/64 Bibliographic references I Marcel R. Ackermann and Johannes Blömer. Bregman clustering for separable instances. In Scandinavian Workshop on Algorithm Theory (SWAT), pages , Yasemin Altun, Alexander J. Smola, and Thomas Hofmann. Exponential families for conditional random fields. In Uncertainty in Artificial Intelligence (UAI), pages 2 9, Marc Arnaudon and Frank Nielsen. Medians and means in Finsler geometry. LMS Journal of Computation and Mathematics, 15, Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6: , Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2): , Nikolai Nikolaevich Chentsov. Statistical Decision Rules and Optimal Inferences. Transactions of Mathematics Monograph, numero 53, Published in russian in 1972.

60 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/64 Bibliographic references II José Manuel Corcuera and Federica Giummolé. A characterization of monotone and regular divergences. Annals of the Institute of Statistical Mathematics, 50(3): , Vincent Garcia and Frank Nielsen. Simplification and hierarchical representations of mixtures of exponential families. Signal Processing (Elsevier), 90(12): , Vincent Garcia, Frank Nielsen, and Richard Nock. Levels of details for Gaussian mixture models. In Asian Conference on Computer Vision (ACCV), volume 2, pages , François Labelle and Jonathan Richard Shewchuk. Anisotropic Voronoi diagrams and guaranteed-quality anisotropic mesh generation. In Proceedings of the nineteenth annual symposium on Computational geometry, SCG 03, pages , New York, NY, USA, ACM. Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen. Shape retrieval using hierarchical total Bregman soft clustering. Transactions on Pattern Analysis and Machine Intelligence, 34(12): , Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to shape retrieval. In International Conference on Computer Vision (CVPR), pages , 2010.

61 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/64 Bibliographic references III Frank Nielsen. Cramér-rao lower bound and information geometry. In Rajendra Bhatia, C. S. Rajan, and Ajit Iqbal Singh, editors, Connected at Infinity II: A selection of mathematics by Indians. Hindustan Book Agency (Texts and Readings in Mathematics, TRIM). arxiv Frank Nielsen. Visual Computing: Geometry, Graphics, and Vision. Charles River Media / Thomson Delmar Learning, Frank Nielsen. Chernoff information of exponential families. arxiv, abs/ , Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In International Conference on Pattern Recognition (ICPR), pages , Frank Nielsen. k-mle: A fast algorithm for learning statistical mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, preliminary, technical report on arxiv.

62 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/64 Bibliographic references IV Frank Nielsen. Hypothesis testing, information divergence and computational geometry. In F. Nielsen and F. Barbaresco, editors, Geometric Science of Information (GSI), volume 8085 of LNCS, pages Springer, Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters, 20(3): , Frank Nielsen. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. Signal Processing Letters, IEEE, PP(99):1 1, Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8): , August Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards, arxiv.org:

63 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/64 Bibliographic references V Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri. Jensen divergence based SPD matrix means and applications. In International Conference on Pattern Recognition (ICPR), Frank Nielsen and Richard Nock. The dual Voronoi diagrams with respect to representational Bregman divergences. In International Symposium on Voronoi Diagrams (ISVD), pages 71 78, Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE Transactions on Information Theory, 55(6): , June Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In International Conference on Image Processing (ICIP), pages , Frank Nielsen and Richard Nock. Hyperbolic Voronoi diagrams made easy. In International Conference on Computational Science and its Applications (ICCSA), volume 1, pages 74 80, Los Alamitos, CA, USA, march IEEE Computer Society. Frank Nielsen and Richard Nock. On rényi and tsallis entropies and divergences for exponential families. arxiv, abs/ , 2011.

64 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/64 Bibliographic references VI Frank Nielsen and Richard Nock. Skew Jensen-Bregman Voronoi diagrams. Transactions on Computational Science, 14: , Frank Nielsen and Richard Nock. A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45(3), Olivier Schwander and Frank Nielsen. PyMEF - A framework for exponential families in Python. In IEEE/SP Workshop on Statistical Signal Processing (SSP), Olivier Schwander and Frank Nielsen. Learning mixtures by simplifying kernel density estimators. In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages , Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Transaction on Biomedical Engineering, 58(5): , Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen. Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, 2011.

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org

More information

arxiv: v1 [cs.cg] 14 Sep 2007

arxiv: v1 [cs.cg] 14 Sep 2007 Bregman Voronoi Diagrams: Properties, Algorithms and Applications Frank Nielsen Jean-Daniel Boissonnat Richard Nock arxiv:0709.2196v1 [cs.cg] 14 Sep 2007 Abstract The Voronoi diagram of a finite set of

More information

Legendre transformation and information geometry

Legendre transformation and information geometry Legendre transformation and information geometry CIG-MEMO #2, v1 Frank Nielsen École Polytechnique Sony Computer Science Laboratorie, Inc http://www.informationgeometry.org September 2010 Abstract We explain

More information

Bregman Voronoi diagrams

Bregman Voronoi diagrams Bregman Voronoi diagrams Jean-Daniel Boissonnat, Frank Nielsen, Richard Nock To cite this version: Jean-Daniel Boissonnat, Frank Nielsen, Richard Nock. Bregman Voronoi diagrams. Discrete and Computational

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry Frank Nielsen 1 and Gaëtan Hadjeres 1 École Polytechnique, France/Sony Computer Science Laboratories, Japan Frank.Nielsen@acm.org

More information

Inference. Data. Model. Variates

Inference. Data. Model. Variates Data Inference Variates Model ˆθ (,..., ) mˆθn(d) m θ2 M m θ1 (,, ) (,,, ) (,, ) α = :=: (, ) F( ) = = {(, ),, } F( ) X( ) = Γ( ) = Σ = ( ) = ( ) ( ) = { = } :=: (U, ) , = { = } = { = } x 2 e i, e j

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the

More information

2882 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 6, JUNE A formal definition considers probability measures P and Q defined

2882 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 6, JUNE A formal definition considers probability measures P and Q defined 2882 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 6, JUNE 2009 Sided and Symmetrized Bregman Centroids Frank Nielsen, Member, IEEE, and Richard Nock Abstract In this paper, we generalize the notions

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

Introduction to Information Geometry

Introduction to Information Geometry Introduction to Information Geometry based on the book Methods of Information Geometry written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Outline 1 Introduction to differential geometry

More information

arxiv: v2 [cs.cv] 19 Dec 2011

arxiv: v2 [cs.cv] 19 Dec 2011 A family of statistical symmetric divergences based on Jensen s inequality arxiv:009.4004v [cs.cv] 9 Dec 0 Abstract Frank Nielsen a,b, a Ecole Polytechnique Computer Science Department (LIX) Palaiseau,

More information

Journée Interdisciplinaire Mathématiques Musique

Journée Interdisciplinaire Mathématiques Musique Journée Interdisciplinaire Mathématiques Musique Music Information Geometry Arnaud Dessein 1,2 and Arshia Cont 1 1 Institute for Research and Coordination of Acoustics and Music, Paris, France 2 Japanese-French

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Chapter 2 Exponential Families and Mixture Families of Probability Distributions Chapter 2 Exponential Families and Mixture Families of Probability Distributions The present chapter studies the geometry of the exponential family of probability distributions. It is not only a typical

More information

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure Entropy 014, 16, 131-145; doi:10.3390/e1604131 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable

More information

Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Tailored Bregman Ball Trees for Effective Nearest Neighbors Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia

More information

Lecture 1 October 9, 2013

Lecture 1 October 9, 2013 Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/

More information

Bregman Voronoi Diagrams

Bregman Voronoi Diagrams DOI 10.1007/s00454-010-9256-1 Bregman Voronoi Diagrams Jean-Daniel Boissonnat Frank Nielsen Richard Nock Received: 12 September 2008 / Revised: 17 February 2010 / Accepted: 17 February 2010 Springer Science+Business

More information

arxiv: v2 [math.st] 22 Aug 2018

arxiv: v2 [math.st] 22 Aug 2018 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES TOMOHIRO NISHIYAMA arxiv:808.0648v2 [math.st] 22 Aug 208 Abstract. In this paper, we introduce new classes of divergences by

More information

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem

More information

Cramér-Rao Lower Bound and Information Geometry

Cramér-Rao Lower Bound and Information Geometry arxiv:1301.3578v2 [cs.it] 24 Jan 2013 Cramér-Rao Lower Bound and Information Geometry Frank Nielsen Sony Computer Science Laboratories Inc., Japan École Polytechnique, LIX, France 1 Introduction and historical

More information

Total Jensen divergences: Definition, Properties and k-means++ Clustering

Total Jensen divergences: Definition, Properties and k-means++ Clustering Total Jensen divergences: Definition, Properties and k-means++ Clustering Frank Nielsen, Senior Member, IEEE Sony Computer Science Laboratories, Inc. 3-4-3 Higashi Gotanda, 4-00 Shinagawa-ku arxiv:309.709v

More information

A Geometric View of Conjugate Priors

A Geometric View of Conjugate Priors Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Geometric View of Conjugate Priors Arvind Agarwal Department of Computer Science University of Maryland College

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information In Geometric Science of Information, 2013, Paris. Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information Roman V. Belavkin 1 School of Engineering and Information Sciences Middlesex University,

More information

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008 Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008 Contents Bregman Divergences Bregman Matrix Divergences Relation to Exponential Family Applications Definition Properties Generalization

More information

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves Chapter 3 Riemannian Manifolds - I The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves embedded in Riemannian manifolds. A Riemannian manifold is an abstraction

More information

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Information geometry of mirror descent

Information geometry of mirror descent Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Tutorial: Statistical distance and Fisher information

Tutorial: Statistical distance and Fisher information Tutorial: Statistical distance and Fisher information Pieter Kok Department of Materials, Oxford University, Parks Road, Oxford OX1 3PH, UK Statistical distance We wish to construct a space of probability

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

arxiv: v1 [cs.lg] 22 Oct 2018

arxiv: v1 [cs.lg] 22 Oct 2018 The Bregman chord divergence arxiv:1810.09113v1 [cs.lg] Oct 018 rank Nielsen Sony Computer Science Laboratories Inc, Japan rank.nielsen@acm.org Richard Nock Data 61, Australia The Australian National &

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable

More information

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences MACHINE LEARNING REPORTS Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Report 02/2010 Submitted: 01.04.2010 Published:26.04.2010 T. Villmann and S. Haase University of Applied

More information

A Geometric View of Conjugate Priors

A Geometric View of Conjugate Priors A Geometric View of Conjugate Priors Arvind Agarwal and Hal Daumé III School of Computing, University of Utah, Salt Lake City, Utah, 84112 USA {arvind,hal}@cs.utah.edu Abstract. In Bayesian machine learning,

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Applications of Information Geometry to Hypothesis Testing and Signal Detection CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18 Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable

More information

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions Jun Zhang 1, and Fubo Li 1 University of Michigan, Ann Arbor, Michigan, USA Sichuan University, Chengdu, China

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2012 F, FRI Uppsala University, Information Technology 30 Januari 2012 SI-2012 K. Pelckmans

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Information Geometry on Hierarchy of Probability Distributions

Information Geometry on Hierarchy of Probability Distributions IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 1701 Information Geometry on Hierarchy of Probability Distributions Shun-ichi Amari, Fellow, IEEE Abstract An exponential family or mixture

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2016 F, FRI Uppsala University, Information Technology 13 April 2016 SI-2016 K. Pelckmans

More information

arxiv: v2 [cs.lg] 13 May 2011

arxiv: v2 [cs.lg] 13 May 2011 Statistical exponential families: A digest with flash cards Frank Nielsen and Vincent Garcia arxiv:0911.4863v2 [cs.lg] 13 May 2011 May 16, 2011 v2.0 Abstract This document describes concisely the ubiquitous

More information

Information geometry of Bayesian statistics

Information geometry of Bayesian statistics Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

CLASSIFICATION WITH MIXTURES OF CURVED MAHALANOBIS METRICS

CLASSIFICATION WITH MIXTURES OF CURVED MAHALANOBIS METRICS CLASSIFICATION WITH MIXTURES OF CURVED MAHALANOBIS METRICS Frank Nielsen 1,2, Boris Muzellec 1 1 École Polytechnique, France 2 Sony Computer Science Laboratories, Japan Richard Nock 3,4 3 Data61 & 4 ANU

More information

Spazi vettoriali e misure di similaritá

Spazi vettoriali e misure di similaritá Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2009-10 March 25, 2010 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare

More information

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication.

This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication. This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication. Copyright Pearson Canada Inc. All rights reserved. Copyright Pearson

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Riemannian geometry of surfaces

Riemannian geometry of surfaces Riemannian geometry of surfaces In this note, we will learn how to make sense of the concepts of differential geometry on a surface M, which is not necessarily situated in R 3. This intrinsic approach

More information

Multi-class DTI Segmentation: A Convex Approach

Multi-class DTI Segmentation: A Convex Approach Multi-class DTI Segmentation: A Convex Approach Yuchen Xie 1 Ting Chen 2 Jeffrey Ho 1 Baba C. Vemuri 1 1 Department of CISE, University of Florida, Gainesville, FL 2 IBM Almaden Research Center, San Jose,

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

arxiv: v1 [cs.lg] 1 May 2010

arxiv: v1 [cs.lg] 1 May 2010 A Geometric View of Conjugate Priors Arvind Agarwal and Hal Daumé III School of Computing, University of Utah, Salt Lake City, Utah, 84112 USA {arvind,hal}@cs.utah.edu arxiv:1005.0047v1 [cs.lg] 1 May 2010

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Statistical physics models belonging to the generalised exponential family

Statistical physics models belonging to the generalised exponential family Statistical physics models belonging to the generalised exponential family Jan Naudts Universiteit Antwerpen 1. The generalized exponential family 6. The porous media equation 2. Theorem 7. The microcanonical

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling Paul Marriott 1, Radka Sabolova 2, Germain Van Bever 2, and Frank Critchley 2 1 University of Waterloo, Waterloo, Ontario,

More information

Notation. General. Notation Description See. Sets, Functions, and Spaces. a b & a b The minimum and the maximum of a and b

Notation. General. Notation Description See. Sets, Functions, and Spaces. a b & a b The minimum and the maximum of a and b Notation General Notation Description See a b & a b The minimum and the maximum of a and b a + & a f S u The non-negative part, a 0, and non-positive part, (a 0) of a R The restriction of the function

More information

William P. Thurston. The Geometry and Topology of Three-Manifolds

William P. Thurston. The Geometry and Topology of Three-Manifolds William P. Thurston The Geometry and Topology of Three-Manifolds Electronic version 1.1 - March 00 http://www.msri.org/publications/books/gt3m/ This is an electronic edition of the 1980 notes distributed

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

2. A die is rolled 3 times, the probability of getting a number larger than the previous number each time is

2. A die is rolled 3 times, the probability of getting a number larger than the previous number each time is . If P(A) = x, P = 2x, P(A B) = 2, P ( A B) = 2 3, then the value of x is (A) 5 8 5 36 6 36 36 2. A die is rolled 3 times, the probability of getting a number larger than the previous number each time

More information

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES Author: Abhishek Bhattacharya Coauthor: David Dunson Department of Statistical Science, Duke University 7 th Workshop on Bayesian Nonparametrics Collegio

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Patch matching with polynomial exponential families and projective divergences

Patch matching with polynomial exponential families and projective divergences Patch matching with polynomial exponential families and projective divergences Frank Nielsen 1 and Richard Nock 2 1 École Polytechnique and Sony Computer Science Laboratories Inc Frank.Nielsen@acm.org

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information