Information theoretic perspectives on learning algorithms

Similar documents
Wasserstein GAN. Juho Lee. Jan 23, 2017

Advanced computational methods X Selected Topics: SGD

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning (67577) Lecture 7

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Exercises with solutions (Set D)

Hands-On Learning Theory Fall 2016, Lecture 3

Stochastic Optimization Algorithms Beyond SG

Mini-Course 1: SGD Escapes Saddle Points

Stochastic Proximal Gradient Algorithm

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS260: Machine Learning Algorithms

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Lecture 35: December The fundamental statistical distances

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Variational Inference via Stochastic Backpropagation

Stability of boundary measures

How hard is this function to optimize?

Kyle Reing University of Southern California April 18, 2018

Overfitting, Bias / Variance Analysis

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Distirbutional robustness, regularizing variance, and adversaries

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Stochastic Subgradient Method

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Overparametrization for Landscape Design in Non-convex Optimization

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Robust high-dimensional linear regression: A statistical perspective

Introduction to Statistical Learning Theory

Computational and Statistical Learning Theory

Machine Learning Basics: Maximum Likelihood Estimation

Statistical Machine Learning from Data

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Stochastic and online algorithms

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

CS 630 Basic Probability and Information Theory. Tim Campbell

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Contents 1. Introduction 1 2. Main results 3 3. Proof of the main inequalities 7 4. Application to random dynamical systems 11 References 16

Trade-Offs in Distributed Learning and Optimization

Ways to make neural networks generalize better

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Stochastic Optimization

Non-convex optimization. Issam Laradji

Scenario estimation and generation

NEW FUNCTIONAL INEQUALITIES

Day 3 Lecture 3. Optimizing deep networks

Linear Models for Regression CS534

x log x, which is strictly convex, and use Jensen s Inequality:

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

On the fast convergence of random perturbations of the gradient flow.

Statistical Learning Theory and the C-Loss cost function

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Learning Patterns for Detection with Multiscale Scan Statistics

Lecture 22: Final Review

Linear Models for Regression CS534

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Stochastic Gradient Descent. CS 584: Big Data Analytics

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

On the Complexity of Best Arm Identification with Fixed Confidence

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Shape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008.

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

A random perturbation approach to some stochastic approximation algorithms in optimization.

Classification Logistic Regression

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Linear Models for Regression CS534

Empirical Risk Minimization

Learning Energy-Based Models of High-Dimensional Data

Linear classifiers: Overfitting and regularization

On Common Information and the Encoding of Sources that are Not Successively Refinable

ECS171: Machine Learning

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Machine Learning in the Data Revolution Era

Lecture 2: August 31

Estimators based on non-convex programs: Statistical and computational guarantees

Lecture 5 Channel Coding over Continuous Channels

STA 4273H: Statistical Machine Learning

Functional Properties of MMSE

Neural Networks: Optimization & Regularization

Stability results for Logarithmic Sobolev inequality

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Convex Optimization Lecture 16

Lecture 1: Supervised Learning

A picture of the energy landscape of! deep neural networks

Probabilistic Graphical Models

Variational Autoencoder

Stochastic relations of random variables and processes

Selected Topics in Optimization. Some slides borrowed from

Bregman Divergence and Mirror Descent

Information geometry for bivariate distribution control

Nonlinear Optimization Methods for Machine Learning

Transcription:

Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez (Math), Ankit Pensia (CS), Po-Ling Loh (Stats) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 1 / 35

Curve fitting Figure: Given N points in R 2, fit a curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 2 / 35

Curve fitting Figure: Given N points in R 2, fit a curve Forward problem: From dataset to curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 2 / 35

Finding the right fit Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35

Finding the right fit Left is fit, right is overfit Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35

Finding the right fit Left is fit, right is overfit Too wiggly Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35

Finding the right fit Left is fit, right is overfit Too wiggly Not stable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35

Guessing points from curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35

Guessing points from curve Figure: Given curve, find N points Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35

Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35

Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35

Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Curve contains more information about dataset Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35

This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35

This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35

This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35

This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Speculations, open problems, etc. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35

Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Varun Jog (UW-Madison) Information theory in learning May 8, 2018 6 / 35

Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Algorithm equivalent to designing P W S. Very different from channel coding! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 6 / 35

Goal of P W S Loss function: l : W X R Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35

Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35

Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35

Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Minimize empirical loss instead l N (w, S) = 1 N N l(w, X i ) i=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35

Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35

Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35

Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35

Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35

Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Ideally, we want both small. Often, both are analyzed separately. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35

Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35

Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35

Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35

Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Chain rule: I (X 1, X 2 ; Y ) = I (X 1 ; Y ) + I (X 2 ; Y X 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35

Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35

Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35

Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35

Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Notion of stability different from traditional notions Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35

Proof sketch Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35

Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35

Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35

Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Follows directly by alternate characterization of KL(µ ν) as ( ) KL(µ ν) = sup F Fdµ log e F dν Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35

How to use it: key insight Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35

How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35

How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35

How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Bound I (W ; S) by controlling information at each iteration Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35

Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35

Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35

Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35

Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35

Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Run for T steps, output W = f (W 0,..., W T ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35

Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35

Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35

Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35

Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Assumption 3: Sampling is done without looking at W t s; i.e., P(Z t+1 Z (t), W (t), S) = P(Z t+1 Z (t), S) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35

Graphical model S Z 1 Z 2 Z 3 Z T 1 Z T W 0 W 1 W 2 W 3 W T 1 W T 1 2 3 T 1 T Figure: Graphical model illustrating Markov properties among random variables in the algorithm Varun Jog (UW-Madison) Information theory in learning May 8, 2018 15 / 35

Main result Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35

Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t ). Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35

Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t Depends on T longer you optimize, higher the risk of overfitting ). Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35

Implications for gen(µ, P W S ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35

Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35

Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. Corollary (High-probability bound) Let ɛ = ( ) T t=1 d 2 log 1 + η2 t L2. For any α > 0 and 0 < β 1, if dσ 2 t n > 8R2 α 2 ( ɛ β + log( 2 β ) ), we have t=1 P S,W ( L µ (W ) L S (W ) > α) β, (2) where the probability is with respect to S µ n and W. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n Best known bounds by Mou et al. (2017) are O(1/n) but our bounds more general t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35

Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35

Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35

Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35

Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Bounds in expectation and high probability follow directly from this bound Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35

Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35

Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35

Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35

Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Same bound also holds for noisy Nesterov s accelerated gradient descent method (1983) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35

Proof sketch Lots of Markov chains! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35

Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because S! Z1 T! W0 T! W Figure: Data processing inequality Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35

Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35

Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35

Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) Bottom line: Bound one step information between W t and Z t t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35

Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35

Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35

Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Gaussian distribution maximizes entropy for fixed variance, giving I (W t ; Z t W t 1 ) d ( 2 log 1 + η2 t L 2 ) dσ 2 t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35

What s wrong with mutual information Mutual information is great, but... Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35

What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35

What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35

What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35

What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Noisy algorithms are essential for using mutual information based bounds Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35

Wasserstein metric µ Varun Jog (UW-Madison) Information theory in learning May 8, 2018 24 / 35

Wasserstein metric µ Wasserstein distance given by ( W p (µ, ν) = inf E X Y p P XY Π(µ,ν) ) 1/p where Π(µ, ν) is the set of coupling such that marginals are µ and ν Varun Jog (UW-Madison) Information theory in learning May 8, 2018 24 / 35

W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35

W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35

W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35

W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν For µ and ν in R, W2 2 (µ, ν) = where F and G are cdf s of µ and ν F 1 (x) G 1 (x) 2 dx 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35

Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35

Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35

Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Measure average separation of P S W from P S (looks like a p-th moment in the space of distributions) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35

Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35

Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35

Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35

Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena For p [1, 2] this inequality tensorizes! This means µ n satisfies inequality T p (cn 2/p 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35

Comparison to I (W ; S) In general, not comparable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35

Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W2 2 2 2c (P S, P S w )dp W (w) L n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35

Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W2 2 2 2c (P S, P S w )dp W (w) L n I (P S; P W ) In particular, for Gaussian data, Wasserstein bound strictly stronger Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35

Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Varun Jog (UW-Madison) Information theory in learning May 8, 2018 29 / 35

Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 1, then W 1 (P S, P S W ) 2cn KL(P S W P S ) and so L 2c W 1 (P S, P n S w )dp W (w) L W n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 29 / 35

Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 30 / 35

Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Key insight: Any coupling of ( S, W, S, W ) that has the correct marginals on (S, W ) and ( S, W ) leads to the same expected value above Varun Jog (UW-Madison) Information theory in learning May 8, 2018 30 / 35

Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35

Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35

Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Pick optimal joint distribution of P S, S W to minimize bound Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35

Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35

Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35

Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35

Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35

Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Pre-process data to deliberately make backward channel noisy (data augmentation, smoothing, etc.) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35

Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35

Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35

Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35

Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Essentially same problem, but connections still unclear Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35

Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35

Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35

Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35

Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Chain rule? Data processing? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35

Thank you! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 35 / 35