Information geometry of mirror descent

Size: px

Start display at page:

Download "Information geometry of mirror descent"

Laurence Peters
6 years ago
Views:

1 Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee (Duke) 29 Oct 2015 Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

2 Optimization of large-scale problems Optimization of a function f (θ) where θ R p. O( p) - convergence rate of standard subgradient descent. A problem in modern optimization, e.g. machine learning. Mirror descent [A Nemirovski, A Beck & M Teboulle, 2003]: O(log p) - convergence rate of mirror descent. Widely used tool in optimization and machine learning. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

3 Differential geometry in statistics (1) Cramér-Rao lower bound (Rao 1945) - Lower bound on the variance of an estimator is a function of curvature. Sometimes called Cramér-Rao-Fréchet-Darmois lower bound. (2) Invariant (non-informative) priors (Jeffreys 1946) - An uniformative prior distribution for a parameter space is based on a differential form. (3) Information geometry (Amari 1985) - Differential geometry of probability distributions. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

4 Stochastic gradient descent Given a convex differentiable cost function, f : Θ R. Generate a sequence of parameters {θ t } t=1 which incur a loss f (θ t) that minimize regret at a time T, T t=1 f (θ t). One solution where (α t ) t=0 θ t+1 = θ t α t f (θ t ), denotes a sequence of step-sizes. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

5 Natural gradient For certain cost functions (log-likelihoods of exponential family models) the set of parameters Θ are supported on a p-dimensional Riemannian manifold, (M, H). Typically the metric tensor H = (h jk ) is determined by the Fisher information matrix [ ( )( ) ] θ (I(θ)) ij = E Data f (x; θ) f (x; θ), i, j = 1,..., p. θ i θ j Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

6 Natural gradient Given a cost function f on the Riemannian manifold f : M R, the natural gradient descent step is: θ t+1 = θ t α t H 1 (θ t ) f (θ t ), where H 1 is the inverse of the Riemannian metric. The natural gradient algorithm steps in the direction of steepest descent along the Riemannian manifold (M, H). It requires a matrix inversion. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

7 Mirror descent Gradient descent can be written θ t+1 = arg min θ Θ { θ, f (θ t ) + 1 2α t θ θ t 2 2 }. For a (strictly) convex proximity function Ψ : R p R p R + mirror descent is { θ t+1 = arg min θ, f (θ t ) + 1 } Ψ(θ, θ t ). θ Θ α t Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

8 Bregman divergence Let G : Θ R be a strictly convex twice-differentiable function the Bregman divergence is B G (θ, θ ) = G(θ) G(θ ) G(θ ), θ θ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

9 Bregman divergences for exponential family Family G(θ) B G (θ, θ ) 1 N (θ, I p p ) 2 θ θ θ 2 2 Poi(e θ ) ) exp(θ) exp ( (θ/θ ) ) exp(θ ), θ θ Be( 1 log(1 + exp(θ)) log 1+e θ e θ, θ θ 1+e θ 1+e θ 1+e θ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

10 Mirror descent Mirror descent using the Bregman divergence as the proximity function { θ t+1 = arg min θ, f (θ t ) + 1 } B G (θ, θ t ). θ α t Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

11 Convex duals The convex conjugate function for a function G is defined to be: H(µ) := sup { θ, µ G(θ)}. θ Θ Let µ = g(θ) Φ be the extremal point of the dual. The dual Bregnman divergence B H : Φ Φ R + is B H (µ, µ ) = H(µ) H(µ ) H(µ ), µ µ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

12 Dual Bregman divergences for exponential family G(θ) H(µ) B H (µ, µ ) 1 2 θ µ µ µ 2 2 exp(θ) µ, log µ µ µ log µ µ log(1 + exp(θ)) η log µ (1 µ) log +(1 µ) log(1 µ) +µ log µ µ ( ) 1 µ 1 µ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

13 Manifolds in primal and dual co-ordinates B G (, ) induces a Riemannian manifold (Θ, 2 G) in the primal co-ordinates. Φ be the image of Θ under the continuous map g = G. B H : Φ Φ R + induces the same Riemannian manifold (Φ, 2 H) under dual co-ordinates Φ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

14 Equivalence Theorem (Raskutti, Mukherjee) The mirror descent step with Bregman divergence defined by G applied to function f in the space Θ is equivalent to the natural gradient step along Riemannian manifold (Φ, 2 H) in dual co-ordinates. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

15 Consequences Exponential family with density: p(y θ) = h(y) exp( θ, y G(θ)). Consider the following mirror descent step given y t { } θ, 1 θ t+1 = arg min θ B G (θ, h(y t )) θ=θt + B G (θ, θ t ). θ α t In dual coordinates one would minimize The natural gradient step is f t (µ; y t ) = log p(y t µ) = B H (y t, µ). µ t+1 = µ t α t [ 2 H(µ t )] 1 B H (y t, µ t ), = µ t+1 = µ t α t (µ t y t ), the curvature of the loss B H (y t, µ t ) matches the metric tensor 2 H(µ). Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

16 Statistical efficiency Given independent samples Y T = (y 1,..., y T ) and a sequence of unbiased estimators µ T is Fisher efficient if lim E Y T [( µ T µ)( µ T µ) T ] 1 T T 2 H, where 2 H is the inverse of the Fisher information matrix. Theorem (Raskutti, Mukherjee) The mirror descent step applied to the log loss (??) with step-sizes α t = 1 t asymptotically achieves the Cramér-Rao lower bound. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

17 Challenges (1) Information geometry on mixture of manifolds. (2) Proximity functions for functions over the Grassmannian. (3) EM algorithms for mixtures. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

18 Acknowledgements Funding: Center for Systems Biology at Duke NSF DMS and CCF DARPA AFOSR NIH Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

arxiv: v1 [math.oc] 29 Dec 2018

arxiv: v1 [math.oc] 29 Dec 2018 DeGroot-Friedkin Map in Opinion Dynamics is Mirror Descent Abhishek Halder ariv:1812.11293v1 [math.oc] 29 Dec 2018 Abstract We provide a variational interpretation of the DeGroot-Friedkin map in opinion