Information theoretic perspectives on learning algorithms

Size: px

Start display at page:

Download "Information theoretic perspectives on learning algorithms"

Hollie West
5 years ago
Views:

Departments of ECE and Mathematics Shannon Channel

May 8, 2018 Jointly with Adrian Tovar-Lopez (Math),

1 Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez (Math), Ankit Pensia (CS), Po-Ling Loh (Stats) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

2 Curve fitting Figure: Given N points in R 2, fit a curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

3 Curve fitting Figure: Given N points in R 2, fit a curve Forward problem: From dataset to curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

4 Finding the right fit Varun Jog (UW-Madison) Information theory in learning May 8, / 35

5 Finding the right fit Left is fit, right is overfit Varun Jog (UW-Madison) Information theory in learning May 8, / 35

6 Finding the right fit Left is fit, right is overfit Too wiggly Varun Jog (UW-Madison) Information theory in learning May 8, / 35

7 Finding the right fit Left is fit, right is overfit Too wiggly Not stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

8 Guessing points from curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

9 Guessing points from curve Figure: Given curve, find N points Varun Jog (UW-Madison) Information theory in learning May 8, / 35

10 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35

11 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

12 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Curve contains more information about dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35

13 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

14 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

15 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

16 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Speculations, open problems, etc. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

17 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Varun Jog (UW-Madison) Information theory in learning May 8, / 35

18 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Algorithm equivalent to designing P W S. Very different from channel coding! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

19 Goal of P W S Loss function: l : W X R Varun Jog (UW-Madison) Information theory in learning May 8, / 35

20 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Varun Jog (UW-Madison) Information theory in learning May 8, / 35

21 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Varun Jog (UW-Madison) Information theory in learning May 8, / 35

22 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Minimize empirical loss instead l N (w, S) = 1 N N l(w, X i ) i=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

23 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

24 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

25 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Varun Jog (UW-Madison) Information theory in learning May 8, / 35

26 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

27 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Ideally, we want both small. Often, both are analyzed separately. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

28 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

29 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

30 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY

31 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Chain rule: I (X 1, X 2 ; Y ) = I (X 1 ; Y ) + I (X 2 ; Y X 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

32 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Varun Jog (UW-Madison) Information theory in learning May 8, / 35

33 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error Varun Jog (UW-Madison) Information theory in learning May 8, / 35

34 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

35 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Notion of stability different from traditional notions Varun Jog (UW-Madison) Information theory in learning May 8, / 35

36 Proof sketch Varun Jog (UW-Madison) Information theory in learning May 8, / 35

37 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

38 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

39 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Follows directly by alternate characterization of KL(µ ν) as ( ) KL(µ ν) = sup F Fdµ log e F dν Varun Jog (UW-Madison) Information theory in learning May 8, / 35

40 How to use it: key insight Varun Jog (UW-Madison) Information theory in learning May 8, / 35

41 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Varun Jog (UW-Madison) Information theory in learning May 8, / 35

42 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Varun Jog (UW-Madison) Information theory in learning May 8, / 35

43 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Bound I (W ; S) by controlling information at each iteration Varun Jog (UW-Madison) Information theory in learning May 8, / 35

44 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Varun Jog (UW-Madison) Information theory in learning May 8, / 35

45 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

46 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

47 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

48 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Run for T steps, output W = f (W 0,..., W T ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

49 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

50 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Varun Jog (UW-Madison) Information theory in learning May 8, / 35

51 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Varun Jog (UW-Madison) Information theory in learning May 8, / 35

52 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Assumption 3: Sampling is done without looking at W t s; i.e., P(Z t+1 Z (t), W (t), S) = P(Z t+1 Z (t), S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

53 Graphical model S Z 1 Z 2 Z 3 Z T 1 Z T W 0 W 1 W 2 W 3 W T 1 W T T 1 T Figure: Graphical model illustrating Markov properties among random variables in the algorithm Varun Jog (UW-Madison) Information theory in learning May 8, / 35

54 Main result Varun Jog (UW-Madison) Information theory in learning May 8, / 35

55 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35

56 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t Depends on T longer you optimize, higher the risk of overfitting ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35

57 Implications for gen(µ, P W S ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

58 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

59 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. Corollary (High-probability bound) Let ɛ = ( ) T t=1 d 2 log 1 + η2 t L2. For any α > 0 and 0 < β 1, if dσ 2 t n > 8R2 α 2 ( ɛ β + log( 2 β ) ), we have t=1 P S,W ( L µ (W ) L S (W ) > α) β, (2) where the probability is with respect to S µ n and W. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

60 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

61 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: Varun Jog (UW-Madison) Information theory in learning May 8, / 35

62 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

63 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

64 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

65 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

66 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n Best known bounds by Mou et al. (2017) are O(1/n) but our bounds more general t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

67 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

68 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

69 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

70 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Bounds in expectation and high probability follow directly from this bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35

71 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

72 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

73 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

74 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Same bound also holds for noisy Nesterov s accelerated gradient descent method (1983) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

75 Proof sketch Lots of Markov chains! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

76 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because S! Z1 T! W0 T! W Figure: Data processing inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35

77 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Varun Jog (UW-Madison) Information theory in learning May 8, / 35

78 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

79 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) Bottom line: Bound one step information between W t and Z t t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

80 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

81 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

82 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Gaussian distribution maximizes entropy for fixed variance, giving I (W t ; Z t W t 1 ) d ( 2 log 1 + η2 t L 2 ) dσ 2 t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

83 What s wrong with mutual information Mutual information is great, but... Varun Jog (UW-Madison) Information theory in learning May 8, / 35

84 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Varun Jog (UW-Madison) Information theory in learning May 8, / 35

85 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Varun Jog (UW-Madison) Information theory in learning May 8, / 35

86 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Varun Jog (UW-Madison) Information theory in learning May 8, / 35

87 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Noisy algorithms are essential for using mutual information based bounds Varun Jog (UW-Madison) Information theory in learning May 8, / 35

88 Wasserstein metric µ Varun Jog (UW-Madison) Information theory in learning May 8, / 35

89 Wasserstein metric µ Wasserstein distance given by ( W p (µ, ν) = inf E X Y p P XY Π(µ,ν) ) 1/p where Π(µ, ν) is the set of coupling such that marginals are µ and ν Varun Jog (UW-Madison) Information theory in learning May 8, / 35

90 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

91 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

92 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

93 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν For µ and ν in R, W2 2 (µ, ν) = where F and G are cdf s of µ and ν F 1 (x) G 1 (x) 2 dx 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

94 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Varun Jog (UW-Madison) Information theory in learning May 8, / 35

95 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

96 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Measure average separation of P S W from P S (looks like a p-th moment in the space of distributions) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

97 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

98 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35

99 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena Varun Jog (UW-Madison) Information theory in learning May 8, / 35

100 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena For p [1, 2] this inequality tensorizes! This means µ n satisfies inequality T p (cn 2/p 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

101 Comparison to I (W ; S) In general, not comparable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

102 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

103 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) In particular, for Gaussian data, Wasserstein bound strictly stronger Varun Jog (UW-Madison) Information theory in learning May 8, / 35

104 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Varun Jog (UW-Madison) Information theory in learning May 8, / 35

105 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 1, then W 1 (P S, P S W ) 2cn KL(P S W P S ) and so L 2c W 1 (P S, P n S w )dp W (w) L W n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

106 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

107 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Key insight: Any coupling of ( S, W, S, W ) that has the correct marginals on (S, W ) and ( S, W ) leads to the same expected value above Varun Jog (UW-Madison) Information theory in learning May 8, / 35

108 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

109 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Varun Jog (UW-Madison) Information theory in learning May 8, / 35

110 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Pick optimal joint distribution of P S, S W to minimize bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35

111 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

112 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Varun Jog (UW-Madison) Information theory in learning May 8, / 35

113 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

114 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Varun Jog (UW-Madison) Information theory in learning May 8, / 35

115 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Pre-process data to deliberately make backward channel noisy (data augmentation, smoothing, etc.) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

116 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Varun Jog (UW-Madison) Information theory in learning May 8, / 35

117 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Varun Jog (UW-Madison) Information theory in learning May 8, / 35

118 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

119 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Essentially same problem, but connections still unclear Varun Jog (UW-Madison) Information theory in learning May 8, / 35

120 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Varun Jog (UW-Madison) Information theory in learning May 8, / 35

121 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

122 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

123 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Chain rule? Data processing? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

124 Thank you! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)