Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez (Math), Ankit Pensia (CS), Po-Ling Loh (Stats) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 1 / 35
Curve fitting Figure: Given N points in R 2, fit a curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 2 / 35
Curve fitting Figure: Given N points in R 2, fit a curve Forward problem: From dataset to curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 2 / 35
Finding the right fit Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35
Finding the right fit Left is fit, right is overfit Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35
Finding the right fit Left is fit, right is overfit Too wiggly Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35
Finding the right fit Left is fit, right is overfit Too wiggly Not stable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 3 / 35
Guessing points from curve Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35
Guessing points from curve Figure: Given curve, find N points Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35
Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35
Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35
Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Curve contains more information about dataset Varun Jog (UW-Madison) Information theory in learning May 8, 2018 4 / 35
This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35
This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35
This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35
This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Speculations, open problems, etc. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 5 / 35
Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Varun Jog (UW-Madison) Information theory in learning May 8, 2018 6 / 35
Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Algorithm equivalent to designing P W S. Very different from channel coding! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 6 / 35
Goal of P W S Loss function: l : W X R Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35
Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35
Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35
Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Minimize empirical loss instead l N (w, S) = 1 N N l(w, X i ) i=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 7 / 35
Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35
Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35
Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35
Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35
Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Ideally, we want both small. Often, both are analyzed separately. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 8 / 35
Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35
Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35
Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35
Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Chain rule: I (X 1, X 2 ; Y ) = I (X 1 ; Y ) + I (X 2 ; Y X 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 9 / 35
Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35
Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35
Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35
Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Notion of stability different from traditional notions Varun Jog (UW-Madison) Information theory in learning May 8, 2018 10 / 35
Proof sketch Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35
Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35
Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35
Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Follows directly by alternate characterization of KL(µ ν) as ( ) KL(µ ν) = sup F Fdµ log e F dν Varun Jog (UW-Madison) Information theory in learning May 8, 2018 11 / 35
How to use it: key insight Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35
How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35
How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35
How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Bound I (W ; S) by controlling information at each iteration Varun Jog (UW-Madison) Information theory in learning May 8, 2018 12 / 35
Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35
Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35
Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35
Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35
Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Run for T steps, output W = f (W 0,..., W T ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 13 / 35
Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35
Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35
Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35
Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Assumption 3: Sampling is done without looking at W t s; i.e., P(Z t+1 Z (t), W (t), S) = P(Z t+1 Z (t), S) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 14 / 35
Graphical model S Z 1 Z 2 Z 3 Z T 1 Z T W 0 W 1 W 2 W 3 W T 1 W T 1 2 3 T 1 T Figure: Graphical model illustrating Markov properties among random variables in the algorithm Varun Jog (UW-Madison) Information theory in learning May 8, 2018 15 / 35
Main result Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35
Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t ). Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35
Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t Depends on T longer you optimize, higher the risk of overfitting ). Varun Jog (UW-Madison) Information theory in learning May 8, 2018 16 / 35
Implications for gen(µ, P W S ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35
Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35
Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. Corollary (High-probability bound) Let ɛ = ( ) T t=1 d 2 log 1 + η2 t L2. For any α > 0 and 0 < β 1, if dσ 2 t n > 8R2 α 2 ( ɛ β + log( 2 β ) ), we have t=1 P S,W ( L µ (W ) L S (W ) > α) β, (2) where the probability is with respect to S µ n and W. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 17 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n Best known bounds by Mou et al. (2017) are O(1/n) but our bounds more general t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 18 / 35
Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35
Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35
Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35
Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Bounds in expectation and high probability follow directly from this bound Varun Jog (UW-Madison) Information theory in learning May 8, 2018 19 / 35
Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35
Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35
Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35
Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Same bound also holds for noisy Nesterov s accelerated gradient descent method (1983) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 20 / 35
Proof sketch Lots of Markov chains! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35
Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because S! Z1 T! W0 T! W Figure: Data processing inequality Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35
Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35
Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35
Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) Bottom line: Bound one step information between W t and Z t t=1 Varun Jog (UW-Madison) Information theory in learning May 8, 2018 21 / 35
Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35
Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35
Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Gaussian distribution maximizes entropy for fixed variance, giving I (W t ; Z t W t 1 ) d ( 2 log 1 + η2 t L 2 ) dσ 2 t Varun Jog (UW-Madison) Information theory in learning May 8, 2018 22 / 35
What s wrong with mutual information Mutual information is great, but... Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35
What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35
What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35
What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35
What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Noisy algorithms are essential for using mutual information based bounds Varun Jog (UW-Madison) Information theory in learning May 8, 2018 23 / 35
Wasserstein metric µ Varun Jog (UW-Madison) Information theory in learning May 8, 2018 24 / 35
Wasserstein metric µ Wasserstein distance given by ( W p (µ, ν) = inf E X Y p P XY Π(µ,ν) ) 1/p where Π(µ, ν) is the set of coupling such that marginals are µ and ν Varun Jog (UW-Madison) Information theory in learning May 8, 2018 24 / 35
W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35
W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35
W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35
W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν For µ and ν in R, W2 2 (µ, ν) = where F and G are cdf s of µ and ν F 1 (x) G 1 (x) 2 dx 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, 2018 25 / 35
Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35
Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35
Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Measure average separation of P S W from P S (looks like a p-th moment in the space of distributions) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 26 / 35
Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35
Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35
Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35
Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena For p [1, 2] this inequality tensorizes! This means µ n satisfies inequality T p (cn 2/p 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 27 / 35
Comparison to I (W ; S) In general, not comparable Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35
Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W2 2 2 2c (P S, P S w )dp W (w) L n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35
Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W2 2 2 2c (P S, P S w )dp W (w) L n I (P S; P W ) In particular, for Gaussian data, Wasserstein bound strictly stronger Varun Jog (UW-Madison) Information theory in learning May 8, 2018 28 / 35
Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Varun Jog (UW-Madison) Information theory in learning May 8, 2018 29 / 35
Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 1, then W 1 (P S, P S W ) 2cn KL(P S W P S ) and so L 2c W 1 (P S, P n S w )dp W (w) L W n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 29 / 35
Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Varun Jog (UW-Madison) Information theory in learning May 8, 2018 30 / 35
Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Key insight: Any coupling of ( S, W, S, W ) that has the correct marginals on (S, W ) and ( S, W ) leads to the same expected value above Varun Jog (UW-Madison) Information theory in learning May 8, 2018 30 / 35
Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35
Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35
Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Pick optimal joint distribution of P S, S W to minimize bound Varun Jog (UW-Madison) Information theory in learning May 8, 2018 31 / 35
Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35
Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35
Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35
Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35
Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Pre-process data to deliberately make backward channel noisy (data augmentation, smoothing, etc.) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 32 / 35
Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35
Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35
Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35
Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Essentially same problem, but connections still unclear Varun Jog (UW-Madison) Information theory in learning May 8, 2018 33 / 35
Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35
Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35
Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35
Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Chain rule? Data processing? Varun Jog (UW-Madison) Information theory in learning May 8, 2018 34 / 35
Thank you! Varun Jog (UW-Madison) Information theory in learning May 8, 2018 35 / 35