Information geometry of mirror descent

Size: px
Start display at page:

Download "Information geometry of mirror descent"

Transcription

1 Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee (Duke) 29 Oct 2015 Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

2 Optimization of large-scale problems Optimization of a function f (θ) where θ R p. O( p) - convergence rate of standard subgradient descent. A problem in modern optimization, e.g. machine learning. Mirror descent [A Nemirovski, A Beck & M Teboulle, 2003]: O(log p) - convergence rate of mirror descent. Widely used tool in optimization and machine learning. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

3 Differential geometry in statistics (1) Cramér-Rao lower bound (Rao 1945) - Lower bound on the variance of an estimator is a function of curvature. Sometimes called Cramér-Rao-Fréchet-Darmois lower bound. (2) Invariant (non-informative) priors (Jeffreys 1946) - An uniformative prior distribution for a parameter space is based on a differential form. (3) Information geometry (Amari 1985) - Differential geometry of probability distributions. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

4 Stochastic gradient descent Given a convex differentiable cost function, f : Θ R. Generate a sequence of parameters {θ t } t=1 which incur a loss f (θ t) that minimize regret at a time T, T t=1 f (θ t). One solution where (α t ) t=0 θ t+1 = θ t α t f (θ t ), denotes a sequence of step-sizes. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

5 Natural gradient For certain cost functions (log-likelihoods of exponential family models) the set of parameters Θ are supported on a p-dimensional Riemannian manifold, (M, H). Typically the metric tensor H = (h jk ) is determined by the Fisher information matrix [ ( )( ) ] θ (I(θ)) ij = E Data f (x; θ) f (x; θ), i, j = 1,..., p. θ i θ j Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

6 Natural gradient Given a cost function f on the Riemannian manifold f : M R, the natural gradient descent step is: θ t+1 = θ t α t H 1 (θ t ) f (θ t ), where H 1 is the inverse of the Riemannian metric. The natural gradient algorithm steps in the direction of steepest descent along the Riemannian manifold (M, H). It requires a matrix inversion. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

7 Mirror descent Gradient descent can be written θ t+1 = arg min θ Θ { θ, f (θ t ) + 1 2α t θ θ t 2 2 }. For a (strictly) convex proximity function Ψ : R p R p R + mirror descent is { θ t+1 = arg min θ, f (θ t ) + 1 } Ψ(θ, θ t ). θ Θ α t Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

8 Bregman divergence Let G : Θ R be a strictly convex twice-differentiable function the Bregman divergence is B G (θ, θ ) = G(θ) G(θ ) G(θ ), θ θ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

9 Bregman divergences for exponential family Family G(θ) B G (θ, θ ) 1 N (θ, I p p ) 2 θ θ θ 2 2 Poi(e θ ) ) exp(θ) exp ( (θ/θ ) ) exp(θ ), θ θ Be( 1 log(1 + exp(θ)) log 1+e θ e θ, θ θ 1+e θ 1+e θ 1+e θ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

10 Mirror descent Mirror descent using the Bregman divergence as the proximity function { θ t+1 = arg min θ, f (θ t ) + 1 } B G (θ, θ t ). θ α t Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

11 Convex duals The convex conjugate function for a function G is defined to be: H(µ) := sup { θ, µ G(θ)}. θ Θ Let µ = g(θ) Φ be the extremal point of the dual. The dual Bregnman divergence B H : Φ Φ R + is B H (µ, µ ) = H(µ) H(µ ) H(µ ), µ µ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

12 Dual Bregman divergences for exponential family G(θ) H(µ) B H (µ, µ ) 1 2 θ µ µ µ 2 2 exp(θ) µ, log µ µ µ log µ µ log(1 + exp(θ)) η log µ (1 µ) log +(1 µ) log(1 µ) +µ log µ µ ( ) 1 µ 1 µ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

13 Manifolds in primal and dual co-ordinates B G (, ) induces a Riemannian manifold (Θ, 2 G) in the primal co-ordinates. Φ be the image of Θ under the continuous map g = G. B H : Φ Φ R + induces the same Riemannian manifold (Φ, 2 H) under dual co-ordinates Φ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

14 Equivalence Theorem (Raskutti, Mukherjee) The mirror descent step with Bregman divergence defined by G applied to function f in the space Θ is equivalent to the natural gradient step along Riemannian manifold (Φ, 2 H) in dual co-ordinates. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

15 Consequences Exponential family with density: p(y θ) = h(y) exp( θ, y G(θ)). Consider the following mirror descent step given y t { } θ, 1 θ t+1 = arg min θ B G (θ, h(y t )) θ=θt + B G (θ, θ t ). θ α t In dual coordinates one would minimize The natural gradient step is f t (µ; y t ) = log p(y t µ) = B H (y t, µ). µ t+1 = µ t α t [ 2 H(µ t )] 1 B H (y t, µ t ), = µ t+1 = µ t α t (µ t y t ), the curvature of the loss B H (y t, µ t ) matches the metric tensor 2 H(µ). Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

16 Statistical efficiency Given independent samples Y T = (y 1,..., y T ) and a sequence of unbiased estimators µ T is Fisher efficient if lim E Y T [( µ T µ)( µ T µ) T ] 1 T T 2 H, where 2 H is the inverse of the Fisher information matrix. Theorem (Raskutti, Mukherjee) The mirror descent step applied to the log loss (??) with step-sizes α t = 1 t asymptotically achieves the Cramér-Rao lower bound. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

17 Challenges (1) Information geometry on mixture of manifolds. (2) Proximity functions for functions over the Grassmannian. (3) EM algorithms for mixtures. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

18 Acknowledgements Funding: Center for Systems Biology at Duke NSF DMS and CCF DARPA AFOSR NIH Anthea Monod (Duke) Information geometry of mirror descent 29 Oct / 18

arxiv: v1 [math.oc] 29 Dec 2018

arxiv: v1 [math.oc] 29 Dec 2018 DeGroot-Friedkin Map in Opinion Dynamics is Mirror Descent Abhishek Halder ariv:1812.11293v1 [math.oc] 29 Dec 2018 Abstract We provide a variational interpretation of the DeGroot-Friedkin map in opinion

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Information geometry of Bayesian statistics

Information geometry of Bayesian statistics Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.

More information

Natural Evolution Strategies for Direct Search

Natural Evolution Strategies for Direct Search Tobias Glasmachers Natural Evolution Strategies for Direct Search 1 Natural Evolution Strategies for Direct Search PGMO-COPI 2014 Recent Advances on Continuous Randomized black-box optimization Thursday

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter

More information

New Insights and Perspectives on the Natural Gradient Method

New Insights and Perspectives on the Natural Gradient Method 1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18

More information

Hessian Riemannian Gradient Flows in Convex Programming

Hessian Riemannian Gradient Flows in Convex Programming Hessian Riemannian Gradient Flows in Convex Programming Felipe Alvarez, Jérôme Bolte, Olivier Brahic INTERNATIONAL CONFERENCE ON MODELING AND OPTIMIZATION MODOPT 2004 Universidad de La Frontera, Temuco,

More information

Energetic Natural Gradient Descent

Energetic Natural Gradient Descent Descent Philip S. Thomas Bruno Castro da Silva Christoph Dann Emma Brunskill PHILIPT@CS.CMU.EDU BSILVA@INF.UFRGS.BR CDANN@CS.CMU.EDU EBRUN@CS.CMU.EDU Abstract We propose a new class of algorithms for minimizing

More information

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Chapter 2 Exponential Families and Mixture Families of Probability Distributions Chapter 2 Exponential Families and Mixture Families of Probability Distributions The present chapter studies the geometry of the exponential family of probability distributions. It is not only a typical

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Inference. Data. Model. Variates

Inference. Data. Model. Variates Data Inference Variates Model ˆθ (,..., ) mˆθn(d) m θ2 M m θ1 (,, ) (,,, ) (,, ) α = :=: (, ) F( ) = = {(, ),, } F( ) X( ) = Γ( ) = Σ = ( ) = ( ) ( ) = { = } :=: (U, ) , = { = } = { = } x 2 e i, e j

More information

Introduction to Information Geometry

Introduction to Information Geometry Introduction to Information Geometry based on the book Methods of Information Geometry written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Outline 1 Introduction to differential geometry

More information

Statistical physics models belonging to the generalised exponential family

Statistical physics models belonging to the generalised exponential family Statistical physics models belonging to the generalised exponential family Jan Naudts Universiteit Antwerpen 1. The generalized exponential family 6. The porous media equation 2. Theorem 7. The microcanonical

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Estimators, escort probabilities, and φ-exponential families in statistical physics

Estimators, escort probabilities, and φ-exponential families in statistical physics Estimators, escort probabilities, and φ-exponential families in statistical physics arxiv:math-ph/425v 4 Feb 24 Jan Naudts Departement Natuurkunde, Universiteit Antwerpen UIA, Universiteitsplein, 26 Antwerpen,

More information

Lecture 17: Likelihood ratio and asymptotic tests

Lecture 17: Likelihood ratio and asymptotic tests Lecture 17: Likelihood ratio and asymptotic tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f

More information

Belief Propagation, Information Projections, and Dykstra s Algorithm

Belief Propagation, Information Projections, and Dykstra s Algorithm Belief Propagation, Information Projections, and Dykstra s Algorithm John MacLaren Walsh, PhD Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu

More information

Dimensionality Reduction with Generalized Linear Models

Dimensionality Reduction with Generalized Linear Models Dimensionality Reduction with Generalized Linear Models Mo Chen 1 Wei Li 1 Wei Zhang 2 Xiaogang Wang 1 1 Department of Electronic Engineering 2 Department of Information Engineering The Chinese University

More information

Manifold Monte Carlo Methods

Manifold Monte Carlo Methods Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October

More information

Basis Adaptation for Sparse Nonlinear Reinforcement Learning

Basis Adaptation for Sparse Nonlinear Reinforcement Learning Basis Adaptation for Sparse Nonlinear Reinforcement Learning Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek School of Computer Science University of Massachusetts, Amherst mahadeva@cs.umass.edu,

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Information geometry of the power inverse Gaussian distribution

Information geometry of the power inverse Gaussian distribution Information geometry of the power inverse Gaussian distribution Zhenning Zhang, Huafei Sun and Fengwei Zhong Abstract. The power inverse Gaussian distribution is a common distribution in reliability analysis

More information

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Applications of Information Geometry to Hypothesis Testing and Signal Detection CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Information Geometry

Information Geometry 2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

Annealing Between Distributions by Averaging Moments

Annealing Between Distributions by Averaging Moments Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games

Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games Po-An Chen Institute of Information Management National Chiao Tung University, Taiwan poanchen@nctu.edu.tw Abstract.

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient

Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient Shun-ichi Amari RIKEN Frontier Research Program, RIKEN, Hirosawa 2-1, Wako-shi 351-01, Japan amari@zoo.riken.go.jp Abstract The

More information

INTRINSIC MEAN ON MANIFOLDS. Abhishek Bhattacharya Project Advisor: Dr.Rabi Bhattacharya

INTRINSIC MEAN ON MANIFOLDS. Abhishek Bhattacharya Project Advisor: Dr.Rabi Bhattacharya INTRINSIC MEAN ON MANIFOLDS Abhishek Bhattacharya Project Advisor: Dr.Rabi Bhattacharya 1 Overview Properties of Intrinsic mean on Riemannian manifolds have been presented. The results have been applied

More information

Lecture 26: Likelihood ratio tests

Lecture 26: Likelihood ratio tests Lecture 26: Likelihood ratio tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f θ0 (X) > c 0 for

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Probability on a Riemannian Manifold

Probability on a Riemannian Manifold Probability on a Riemannian Manifold Jennifer Pajda-De La O December 2, 2015 1 Introduction We discuss how we can construct probability theory on a Riemannian manifold. We make comparisons to this and

More information

Fisher Information & Efficiency

Fisher Information & Efficiency Fisher Information & Efficiency Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA 1 Introduction Let f(x θ) be the pdf of X for θ Θ; at times we will also consider a

More information

Column Subset Selection

Column Subset Selection Column Subset Selection Joel A. Tropp Applied & Computational Mathematics California Institute of Technology jtropp@acm.caltech.edu Thanks to B. Recht (Caltech, IST) Research supported in part by NSF,

More information

Characterizing Implicit Bias in Terms of Optimization Geometry

Characterizing Implicit Bias in Terms of Optimization Geometry Suriya Gunasekar 1 Jason Lee Daniel Soudry 3 Nathan Srebro 1 Abstract We study the implicit bias of generic optimization methods, including mirror descent, natural gradient descent, and steepest descent

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

The Volume of Bitnets

The Volume of Bitnets The Volume of Bitnets Carlos C. Rodríguez The University at Albany, SUNY Department of Mathematics and Statistics http://omega.albany.edu:8008/bitnets Abstract. A bitnet is a dag of binary nodes representing

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Parameter Estimation

Parameter Estimation Parameter Estimation Consider a sample of observations on a random variable Y. his generates random variables: (y 1, y 2,, y ). A random sample is a sample (y 1, y 2,, y ) where the random variables y

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

10. Linear Models and Maximum Likelihood Estimation

10. Linear Models and Maximum Likelihood Estimation 10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk Instructor: Victor F. Araman December 4, 2003 Theory and Applications of Stochastic Systems Lecture 0 B60.432.0 Exponential Martingale for Random Walk Let (S n : n 0) be a random walk with i.i.d. increments

More information

Information in a Two-Stage Adaptive Optimal Design

Information in a Two-Stage Adaptive Optimal Design Information in a Two-Stage Adaptive Optimal Design Department of Statistics, University of Missouri Designed Experiments: Recent Advances in Methods and Applications DEMA 2011 Isaac Newton Institute for

More information

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD C.T.J. DODSON AND HIROSHI MATSUZOE Abstract. For the space of gamma distributions with Fisher metric and exponential connections, natural coordinate systems, potential

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Iterative Convex Regularization

Iterative Convex Regularization Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

Dually Flat Geometries in the State Space of Statistical Models

Dually Flat Geometries in the State Space of Statistical Models 1/ 12 Dually Flat Geometries in the State Space of Statistical Models Jan Naudts Universiteit Antwerpen ECEA, November 2016 J. Naudts, Dually Flat Geometries in the State Space of Statistical Models. In

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

arxiv: v2 [stat.ot] 25 Oct 2017

arxiv: v2 [stat.ot] 25 Oct 2017 A GEOMETER S VIEW OF THE THE CRAMÉR-RAO BOUND ON ESTIMATOR VARIANCE ANTHONY D. BLAOM arxiv:1710.01598v2 [stat.ot] 25 Oct 2017 Abstract. The classical Cramér-Rao inequality gives a lower bound for the variance

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Convex Optimization Conjugate, Subdifferential, Proximation

Convex Optimization Conjugate, Subdifferential, Proximation 1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview

More information

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13 The Jeffreys Prior Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 1 / 13 Sir Harold Jeffreys English mathematician, statistician, geophysicist, and astronomer His

More information

Testing Algebraic Hypotheses

Testing Algebraic Hypotheses Testing Algebraic Hypotheses Mathias Drton Department of Statistics University of Chicago 1 / 18 Example: Factor analysis Multivariate normal model based on conditional independence given hidden variable:

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric

More information

A Geometric View of Conjugate Priors

A Geometric View of Conjugate Priors Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Geometric View of Conjugate Priors Arvind Agarwal Department of Computer Science University of Maryland College

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Independent Component Analysis (ICA)

Independent Component Analysis (ICA) Independent Component Analysis (ICA) Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation Numerical computation II Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation Reprojection error Reprojection error = Distance between the

More information

Information Geometric Structure on Positive Definite Matrices and its Applications

Information Geometric Structure on Positive Definite Matrices and its Applications Information Geometric Structure on Positive Definite Matrices and its Applications Atsumi Ohara Osaka University 2010 Feb. 21 at Osaka City University 大阪市立大学数学研究所情報幾何関連分野研究会 2010 情報工学への幾何学的アプローチ 1 Outline

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order MATHEMATICAL ENGINEERING TECHNICAL REPORTS A Superharmonic Prior for the Autoregressive Process of the Second Order Fuyuhiko TANAKA and Fumiyasu KOMAKI METR 2006 30 May 2006 DEPARTMENT OF MATHEMATICAL

More information

arxiv: v1 [cs.lg] 1 May 2010

arxiv: v1 [cs.lg] 1 May 2010 A Geometric View of Conjugate Priors Arvind Agarwal and Hal Daumé III School of Computing, University of Utah, Salt Lake City, Utah, 84112 USA {arvind,hal}@cs.utah.edu arxiv:1005.0047v1 [cs.lg] 1 May 2010

More information