Information theoretic perspectives on learning algorithms

Size: px
Start display at page:

Download "Information theoretic perspectives on learning algorithms"

Transcription

1 Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez (Math), Ankit Pensia (CS), Po-Ling Loh (Stats) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

2 Curve fitting Figure: Given N points in R 2, fit a curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

3 Curve fitting Figure: Given N points in R 2, fit a curve Forward problem: From dataset to curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

4 Finding the right fit Varun Jog (UW-Madison) Information theory in learning May 8, / 35

5 Finding the right fit Left is fit, right is overfit Varun Jog (UW-Madison) Information theory in learning May 8, / 35

6 Finding the right fit Left is fit, right is overfit Too wiggly Varun Jog (UW-Madison) Information theory in learning May 8, / 35

7 Finding the right fit Left is fit, right is overfit Too wiggly Not stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

8 Guessing points from curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35

9 Guessing points from curve Figure: Given curve, find N points Varun Jog (UW-Madison) Information theory in learning May 8, / 35

10 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35

11 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

12 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Curve contains more information about dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35

13 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

14 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

15 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

16 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Speculations, open problems, etc. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

17 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Varun Jog (UW-Madison) Information theory in learning May 8, / 35

18 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Algorithm equivalent to designing P W S. Very different from channel coding! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

19 Goal of P W S Loss function: l : W X R Varun Jog (UW-Madison) Information theory in learning May 8, / 35

20 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Varun Jog (UW-Madison) Information theory in learning May 8, / 35

21 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Varun Jog (UW-Madison) Information theory in learning May 8, / 35

22 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Minimize empirical loss instead l N (w, S) = 1 N N l(w, X i ) i=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

23 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

24 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

25 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Varun Jog (UW-Madison) Information theory in learning May 8, / 35

26 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

27 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Ideally, we want both small. Often, both are analyzed separately. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

28 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

29 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

30 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

31 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Chain rule: I (X 1, X 2 ; Y ) = I (X 1 ; Y ) + I (X 2 ; Y X 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

32 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Varun Jog (UW-Madison) Information theory in learning May 8, / 35

33 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error Varun Jog (UW-Madison) Information theory in learning May 8, / 35

34 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

35 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Notion of stability different from traditional notions Varun Jog (UW-Madison) Information theory in learning May 8, / 35

36 Proof sketch Varun Jog (UW-Madison) Information theory in learning May 8, / 35

37 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

38 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

39 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Follows directly by alternate characterization of KL(µ ν) as ( ) KL(µ ν) = sup F Fdµ log e F dν Varun Jog (UW-Madison) Information theory in learning May 8, / 35

40 How to use it: key insight Varun Jog (UW-Madison) Information theory in learning May 8, / 35

41 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Varun Jog (UW-Madison) Information theory in learning May 8, / 35

42 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Varun Jog (UW-Madison) Information theory in learning May 8, / 35

43 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Bound I (W ; S) by controlling information at each iteration Varun Jog (UW-Madison) Information theory in learning May 8, / 35

44 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Varun Jog (UW-Madison) Information theory in learning May 8, / 35

45 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

46 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

47 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

48 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Run for T steps, output W = f (W 0,..., W T ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

49 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

50 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Varun Jog (UW-Madison) Information theory in learning May 8, / 35

51 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Varun Jog (UW-Madison) Information theory in learning May 8, / 35

52 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Assumption 3: Sampling is done without looking at W t s; i.e., P(Z t+1 Z (t), W (t), S) = P(Z t+1 Z (t), S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

53 Graphical model S Z 1 Z 2 Z 3 Z T 1 Z T W 0 W 1 W 2 W 3 W T 1 W T T 1 T Figure: Graphical model illustrating Markov properties among random variables in the algorithm Varun Jog (UW-Madison) Information theory in learning May 8, / 35

54 Main result Varun Jog (UW-Madison) Information theory in learning May 8, / 35

55 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35

56 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t Depends on T longer you optimize, higher the risk of overfitting ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35

57 Implications for gen(µ, P W S ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

58 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

59 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. Corollary (High-probability bound) Let ɛ = ( ) T t=1 d 2 log 1 + η2 t L2. For any α > 0 and 0 < β 1, if dσ 2 t n > 8R2 α 2 ( ɛ β + log( 2 β ) ), we have t=1 P S,W ( L µ (W ) L S (W ) > α) β, (2) where the probability is with respect to S µ n and W. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

60 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

61 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: Varun Jog (UW-Madison) Information theory in learning May 8, / 35

62 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

63 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

64 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

65 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

66 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n Best known bounds by Mou et al. (2017) are O(1/n) but our bounds more general t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

67 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

68 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

69 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

70 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Bounds in expectation and high probability follow directly from this bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35

71 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35

72 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

73 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

74 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Same bound also holds for noisy Nesterov s accelerated gradient descent method (1983) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

75 Proof sketch Lots of Markov chains! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

76 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because S! Z1 T! W0 T! W Figure: Data processing inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35

77 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Varun Jog (UW-Madison) Information theory in learning May 8, / 35

78 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

79 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) Bottom line: Bound one step information between W t and Z t t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35

80 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

81 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

82 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Gaussian distribution maximizes entropy for fixed variance, giving I (W t ; Z t W t 1 ) d ( 2 log 1 + η2 t L 2 ) dσ 2 t Varun Jog (UW-Madison) Information theory in learning May 8, / 35

83 What s wrong with mutual information Mutual information is great, but... Varun Jog (UW-Madison) Information theory in learning May 8, / 35

84 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Varun Jog (UW-Madison) Information theory in learning May 8, / 35

85 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Varun Jog (UW-Madison) Information theory in learning May 8, / 35

86 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Varun Jog (UW-Madison) Information theory in learning May 8, / 35

87 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Noisy algorithms are essential for using mutual information based bounds Varun Jog (UW-Madison) Information theory in learning May 8, / 35

88 Wasserstein metric µ Varun Jog (UW-Madison) Information theory in learning May 8, / 35

89 Wasserstein metric µ Wasserstein distance given by ( W p (µ, ν) = inf E X Y p P XY Π(µ,ν) ) 1/p where Π(µ, ν) is the set of coupling such that marginals are µ and ν Varun Jog (UW-Madison) Information theory in learning May 8, / 35

90 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

91 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

92 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

93 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν For µ and ν in R, W2 2 (µ, ν) = where F and G are cdf s of µ and ν F 1 (x) G 1 (x) 2 dx 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35

94 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Varun Jog (UW-Madison) Information theory in learning May 8, / 35

95 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

96 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Measure average separation of P S W from P S (looks like a p-th moment in the space of distributions) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

97 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

98 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35

99 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena Varun Jog (UW-Madison) Information theory in learning May 8, / 35

100 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena For p [1, 2] this inequality tensorizes! This means µ n satisfies inequality T p (cn 2/p 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

101 Comparison to I (W ; S) In general, not comparable Varun Jog (UW-Madison) Information theory in learning May 8, / 35

102 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

103 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) In particular, for Gaussian data, Wasserstein bound strictly stronger Varun Jog (UW-Madison) Information theory in learning May 8, / 35

104 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Varun Jog (UW-Madison) Information theory in learning May 8, / 35

105 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 1, then W 1 (P S, P S W ) 2cn KL(P S W P S ) and so L 2c W 1 (P S, P n S w )dp W (w) L W n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

106 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Varun Jog (UW-Madison) Information theory in learning May 8, / 35

107 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Key insight: Any coupling of ( S, W, S, W ) that has the correct marginals on (S, W ) and ( S, W ) leads to the same expected value above Varun Jog (UW-Madison) Information theory in learning May 8, / 35

108 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

109 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Varun Jog (UW-Madison) Information theory in learning May 8, / 35

110 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Pick optimal joint distribution of P S, S W to minimize bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35

111 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

112 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Varun Jog (UW-Madison) Information theory in learning May 8, / 35

113 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

114 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Varun Jog (UW-Madison) Information theory in learning May 8, / 35

115 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Pre-process data to deliberately make backward channel noisy (data augmentation, smoothing, etc.) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

116 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Varun Jog (UW-Madison) Information theory in learning May 8, / 35

117 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Varun Jog (UW-Madison) Information theory in learning May 8, / 35

118 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

119 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Essentially same problem, but connections still unclear Varun Jog (UW-Madison) Information theory in learning May 8, / 35

120 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Varun Jog (UW-Madison) Information theory in learning May 8, / 35

121 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

122 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35

123 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Chain rule? Data processing? Varun Jog (UW-Madison) Information theory in learning May 8, / 35

124 Thank you! Varun Jog (UW-Madison) Information theory in learning May 8, / 35

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

Exercises with solutions (Set D)

Exercises with solutions (Set D) Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Stability of boundary measures

Stability of boundary measures Stability of boundary measures F. Chazal D. Cohen-Steiner Q. Mérigot INRIA Saclay - Ile de France LIX, January 2008 Point cloud geometry Given a set of points sampled near an unknown shape, can we infer

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Contents 1. Introduction 1 2. Main results 3 3. Proof of the main inequalities 7 4. Application to random dynamical systems 11 References 16

Contents 1. Introduction 1 2. Main results 3 3. Proof of the main inequalities 7 4. Application to random dynamical systems 11 References 16 WEIGHTED CSISZÁR-KULLBACK-PINSKER INEQUALITIES AND APPLICATIONS TO TRANSPORTATION INEQUALITIES FRANÇOIS BOLLEY AND CÉDRIC VILLANI Abstract. We strengthen the usual Csiszár-Kullback-Pinsker inequality by

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Scenario estimation and generation

Scenario estimation and generation October 10, 2004 The one-period case Distances of Probability Measures Tensor products of trees Tree reduction A decision problem is subject to uncertainty Uncertainty is represented by probability To

More information

NEW FUNCTIONAL INEQUALITIES

NEW FUNCTIONAL INEQUALITIES 1 / 29 NEW FUNCTIONAL INEQUALITIES VIA STEIN S METHOD Giovanni Peccati (Luxembourg University) IMA, Minneapolis: April 28, 2015 2 / 29 INTRODUCTION Based on two joint works: (1) Nourdin, Peccati and Swan

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Learning Patterns for Detection with Multiscale Scan Statistics

Learning Patterns for Detection with Multiscale Scan Statistics Learning Patterns for Detection with Multiscale Scan Statistics J. Sharpnack 1 1 Statistics Department UC Davis UC Davis Statistics Seminar 2018 Work supported by NSF DMS-1712996 Hyperspectral Gas Detection

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

Shape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008.

Shape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008. Shape Matching: A Metric Geometry Approach Facundo Mémoli. CS 468, Stanford University, Fall 2008. 1 General stuff: class attendance is mandatory will post possible projects soon a couple of days prior

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

On Common Information and the Encoding of Sources that are Not Successively Refinable

On Common Information and the Encoding of Sources that are Not Successively Refinable On Common Information and the Encoding of Sources that are Not Successively Refinable Kumar Viswanatha, Emrah Akyol, Tejaswi Nanjundaswamy and Kenneth Rose ECE Department, University of California - Santa

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Functional Properties of MMSE

Functional Properties of MMSE Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Stability results for Logarithmic Sobolev inequality

Stability results for Logarithmic Sobolev inequality Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Stochastic relations of random variables and processes

Stochastic relations of random variables and processes Stochastic relations of random variables and processes Lasse Leskelä Helsinki University of Technology 7th World Congress in Probability and Statistics Singapore, 18 July 2008 Fundamental problem of applied

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information