Information theoretic perspectives on learning algorithms
|
|
- Hollie West
- 5 years ago
- Views:
Transcription
1 Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez (Math), Ankit Pensia (CS), Po-Ling Loh (Stats) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
2 Curve fitting Figure: Given N points in R 2, fit a curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35
3 Curve fitting Figure: Given N points in R 2, fit a curve Forward problem: From dataset to curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35
4 Finding the right fit Varun Jog (UW-Madison) Information theory in learning May 8, / 35
5 Finding the right fit Left is fit, right is overfit Varun Jog (UW-Madison) Information theory in learning May 8, / 35
6 Finding the right fit Left is fit, right is overfit Too wiggly Varun Jog (UW-Madison) Information theory in learning May 8, / 35
7 Finding the right fit Left is fit, right is overfit Too wiggly Not stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35
8 Guessing points from curve Varun Jog (UW-Madison) Information theory in learning May 8, / 35
9 Guessing points from curve Figure: Given curve, find N points Varun Jog (UW-Madison) Information theory in learning May 8, / 35
10 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35
11 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Varun Jog (UW-Madison) Information theory in learning May 8, / 35
12 Guessing points from curve Figure: Given curve, find N points Backward problem: From curve to dataset Backward problem easier for overfitted curve! Curve contains more information about dataset Varun Jog (UW-Madison) Information theory in learning May 8, / 35
13 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
14 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
15 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
16 This talk Explore information and overfitting connection (Xu & Raginsky, 2017) Analyze generalization error in a large and general class of learning algorithms (Pensia, J., Loh, 2018) Measuring information via optimal transport theory (Tovar-Lopez, J., 2018) Speculations, open problems, etc. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
17 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Varun Jog (UW-Madison) Information theory in learning May 8, / 35
18 Learning algorithm as a channel S Algorithm W Input: Dataset S with N i.i.d. samples (X 1, X 2,..., X n ) µ n Output: W Algorithm equivalent to designing P W S. Very different from channel coding! Varun Jog (UW-Madison) Information theory in learning May 8, / 35
19 Goal of P W S Loss function: l : W X R Varun Jog (UW-Madison) Information theory in learning May 8, / 35
20 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Varun Jog (UW-Madison) Information theory in learning May 8, / 35
21 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Varun Jog (UW-Madison) Information theory in learning May 8, / 35
22 Goal of P W S Loss function: l : W X R Best choice is w w = argmin w W E X µ [l(w, X )] Can t always get what we want... Minimize empirical loss instead l N (w, S) = 1 N N l(w, X i ) i=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
23 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
24 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
25 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Varun Jog (UW-Madison) Information theory in learning May 8, / 35
26 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
27 Generalization error Define expected loss = E X µ P W S P S l(w, X ) (test error) Expected empirical loss = E PWS l N (W, S) (train error) Loss has two parts: Expected loss = (Expected loss - Expected empirical loss) + Expected empirical loss = (test error - train error) + train error Generalization error = test error - train error gen(µ, P W S ) = E PS P W l N (W, S) E PWS l N (W, S) Ideally, we want both small. Often, both are analyzed separately. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
28 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
29 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
30 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
31 Basics of mutual information Mutual information I (X ; Y ) precisely quantifies information between (X, Y ) P XY : Satisfies two nice properties Data processing inequality: I (X ; Y ) = KL(P XY P X P Y ) X Y Z Figure: If X Y Z then I (X ; Y ) I (X ; Z) Chain rule: I (X 1, X 2 ; Y ) = I (X 1 ; Y ) + I (X 2 ; Y X 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
32 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Varun Jog (UW-Madison) Information theory in learning May 8, / 35
33 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error Varun Jog (UW-Madison) Information theory in learning May 8, / 35
34 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Varun Jog (UW-Madison) Information theory in learning May 8, / 35
35 Bounding generalization error using I (W ; S) Theorem (Xu & Raginsky (2017)) Assume that l(w, X ) is R-subgaussian for every w W. Then the following bound holds: 2R 2 gen(µ, P W S ) I (S; W ). (1) n Data-dependent bounds on generalization error If I (W ; S) ɛ, then call P W S as (ɛ, µ) stable Notion of stability different from traditional notions Varun Jog (UW-Madison) Information theory in learning May 8, / 35
36 Proof sketch Varun Jog (UW-Madison) Information theory in learning May 8, / 35
37 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
38 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
39 Proof sketch Lemma (Key Lemma in Raginsky & Xu (2017)) If f (X, Y ) is σ-subgaussian under P X P Y, then Ef (X, Y ) Ef ( X, Ȳ ) 2σ 2 I (X ; Y ), where (X, Y ) P XY and ( X, Ȳ ) P X P Y. Recall I (X ; Y ) = KL(P XY P X P Y ) Follows directly by alternate characterization of KL(µ ν) as ( ) KL(µ ν) = sup F Fdµ log e F dν Varun Jog (UW-Madison) Information theory in learning May 8, / 35
40 How to use it: key insight Varun Jog (UW-Madison) Information theory in learning May 8, / 35
41 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Varun Jog (UW-Madison) Information theory in learning May 8, / 35
42 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Varun Jog (UW-Madison) Information theory in learning May 8, / 35
43 How to use it: key insight Figure: Update W t using some update rule to generate W t+1 Many learning algorithms are iterative Generate W 0, W 1, W 2,..., W T, and output W = f (W 0,..., W T ). For example, W = W T or W = 1 T i W i Bound I (W ; S) by controlling information at each iteration Varun Jog (UW-Madison) Information theory in learning May 8, / 35
44 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Varun Jog (UW-Madison) Information theory in learning May 8, / 35
45 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Varun Jog (UW-Madison) Information theory in learning May 8, / 35
46 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
47 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
48 Noisy, iterative algorithms For t 1, sample Z t S and compute a direction F (W t 1, Z t ) R d Move in the direction after scaling by a stepsize η t Perturb it by isotropic Gaussian noise ξ t N(0, σ 2 t I d ) Overall update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Run for T steps, output W = f (W 0,..., W T ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
49 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
50 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Varun Jog (UW-Madison) Information theory in learning May 8, / 35
51 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Varun Jog (UW-Madison) Information theory in learning May 8, / 35
52 Main assumptions Update equation: W t = W t 1 η t F (W t 1, Z t ) + ξ t, t 1 Assumption 1: l(w, Z) is R-subgaussian Assumption 2: Bounded updates; i.e. sup F (w, z) L w,z Assumption 3: Sampling is done without looking at W t s; i.e., P(Z t+1 Z (t), W (t), S) = P(Z t+1 Z (t), S) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
53 Graphical model S Z 1 Z 2 Z 3 Z T 1 Z T W 0 W 1 W 2 W 3 W T 1 W T T 1 T Figure: Graphical model illustrating Markov properties among random variables in the algorithm Varun Jog (UW-Madison) Information theory in learning May 8, / 35
54 Main result Varun Jog (UW-Madison) Information theory in learning May 8, / 35
55 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35
56 Main result Theorem (Pensia, J., Loh (2018)) The mutual information satisfies the bound I (S; W ) T t=1 d 2 log (1 + η2 t L 2 dσ 2 t Depends on T longer you optimize, higher the risk of overfitting ). Varun Jog (UW-Madison) Information theory in learning May 8, / 35
57 Implications for gen(µ, P W S ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
58 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
59 Implications for gen(µ, P W S ) Corollary (Bound on expectation) The generalization error of our class of iterative algorithms is bounded by T gen(µ, P W S ) R2 ηt 2 L 2 n σt 2. Corollary (High-probability bound) Let ɛ = ( ) T t=1 d 2 log 1 + η2 t L2. For any α > 0 and 0 < β 1, if dσ 2 t n > 8R2 α 2 ( ɛ β + log( 2 β ) ), we have t=1 P S,W ( L µ (W ) L S (W ) > α) β, (2) where the probability is with respect to S µ n and W. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
60 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Varun Jog (UW-Madison) Information theory in learning May 8, / 35
61 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: Varun Jog (UW-Madison) Information theory in learning May 8, / 35
62 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35
63 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, Varun Jog (UW-Madison) Information theory in learning May 8, / 35
64 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
65 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
66 Applications: SGLD SGLD iterates are W t+1 = W t η t l(w t, Z t ) + σ t Z t Common experimental practices for SGLD [Welling & Teh, 2011]: 1 the noise variance σ 2 t = η t, 2 the algorithm is run for K epochs; i.e., T = nk, 3 for a constant c > 0, the stepsizes are η t = c t. Expectation bounds: Using T t=1 1 t log(t ) + 1 gen(µ, P W S ) RL T η t RL c log T + c n n Best known bounds by Mou et al. (2017) are O(1/n) but our bounds more general t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
67 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
68 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
69 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
70 Application: Perturbed SGD Noisy versions of SGD proposed to escape saddle points Ge et al. (2015), Jin et al. (2017) Similar to SGLD, but different noise distribution: W t = W t 1 η ( w l(w t 1, Z t ) + ξ t ), where ξ t Unif(B d ) (unit ball in R d ) Our bound: I (W ; S) Td log(1 + L) Bounds in expectation and high probability follow directly from this bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35
71 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Varun Jog (UW-Madison) Information theory in learning May 8, / 35
72 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Varun Jog (UW-Madison) Information theory in learning May 8, / 35
73 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
74 Application: Noisy momentum A modified version of stochastic gradient Hamiltonian Monte-Carlo, Chen et al. (2014): V t = γ t V t 1 + η t w l(w t 1, Z t ) + ξ t, W t = W t 1 γ t V t 1 η t w l(w t 1, Z t ) + ξ t, Difference is addition of noise to the velocity term V t Treat (V t, W t ) as single parameter, to get I (S; W ) T 2d (1 2 log + η2 t 2L 2 ) 2dσt 2 t=1 Same bound also holds for noisy Nesterov s accelerated gradient descent method (1983) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
75 Proof sketch Lots of Markov chains! Varun Jog (UW-Madison) Information theory in learning May 8, / 35
76 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because S! Z1 T! W0 T! W Figure: Data processing inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35
77 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Varun Jog (UW-Madison) Information theory in learning May 8, / 35
78 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
79 Proof sketch Lots of Markov chains! I (W ; S) I (W T 0 ; Z T 1 ) because Iterative structure means S! Z1 T! W0 T! W Figure: Data processing inequality W 0! Z 1 W 1! Z 2 W 2! Z 3 W 3! W T Use Markovity with chain rule to get I (Z T 1 ; W T 0 ) = T I (Z t ; W t W t 1 ) Bottom line: Bound one step information between W t and Z t t=1 Varun Jog (UW-Madison) Information theory in learning May 8, / 35
80 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Varun Jog (UW-Madison) Information theory in learning May 8, / 35
81 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
82 Proof sketch Recall W t = W t 1 η t F (W t 1, Z t ) + ξ t Using the entropy form of mutual information, I (W t ; Z t W t 1 ) = h(w t W t 1 ) }{{} Variance(W t w t 1 ) η 2 t L2 +σ 2 t h(w t W t 1, Z t ) }{{} =h(ξ t) Gaussian distribution maximizes entropy for fixed variance, giving I (W t ; Z t W t 1 ) d ( 2 log 1 + η2 t L 2 ) dσ 2 t Varun Jog (UW-Madison) Information theory in learning May 8, / 35
83 What s wrong with mutual information Mutual information is great, but... Varun Jog (UW-Madison) Information theory in learning May 8, / 35
84 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Varun Jog (UW-Madison) Information theory in learning May 8, / 35
85 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Varun Jog (UW-Madison) Information theory in learning May 8, / 35
86 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Varun Jog (UW-Madison) Information theory in learning May 8, / 35
87 What s wrong with mutual information Mutual information is great, but... If µ is not absolutely continuous w.r.t. ν, then KL(µ ν) = + Many cases when mutual information I (W ; S) shoots to infinity Cannot use bounds for stochastic gradient descent (SGD) :( Noisy algorithms are essential for using mutual information based bounds Varun Jog (UW-Madison) Information theory in learning May 8, / 35
88 Wasserstein metric µ Varun Jog (UW-Madison) Information theory in learning May 8, / 35
89 Wasserstein metric µ Wasserstein distance given by ( W p (µ, ν) = inf E X Y p P XY Π(µ,ν) ) 1/p where Π(µ, ν) is the set of coupling such that marginals are µ and ν Varun Jog (UW-Madison) Information theory in learning May 8, / 35
90 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35
91 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35
92 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35
93 W p for p = 1 and 2 W 1 also called Earth Mover distance or Kantorovich-Rubinstein distance { } W 1 (µ, ν) = sup f (dµ dν) f continuous and 1 Lipschitz Lots of fascinating theory 1 for W 2 Optimal coupling in Π(µ, ν) is a function T such that T #µ = ν For µ and ν in R, W2 2 (µ, ν) = where F and G are cdf s of µ and ν F 1 (x) G 1 (x) 2 dx 1 Topics in Optimal Transportation by Cedric Villani Varun Jog (UW-Madison) Information theory in learning May 8, / 35
94 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Varun Jog (UW-Madison) Information theory in learning May 8, / 35
95 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
96 Wasserstein bounds on gen(µ, P W S ) Assumption: l(w, x) is Lipschitz in x for each fixed w; i.e. l(w, x 1 ) l(w, x 2 ) L x 1 x 2 p Theorem (Tovar-Lopez & J., (2018)) If l(w, ) is L-Lipschitz in p, generalization error satisfies the following bound: gen(µ, P W S ) L n 1 p ( W ) 1 Wp p p (P S, P S w )dp W (w) Measure average separation of P S W from P S (looks like a p-th moment in the space of distributions) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
97 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
98 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Varun Jog (UW-Madison) Information theory in learning May 8, / 35
99 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena Varun Jog (UW-Madison) Information theory in learning May 8, / 35
100 Wasserstein and KL Definition We say µ satisfies a T p (c) transportation inequality with constant c > 0 if for all ν, we have W p (µ, ν) 2cKL(ν µ) Example: standard normal satisfies T 2 (1) inequality Transport inequalities used to show concentration phenomena For p [1, 2] this inequality tensorizes! This means µ n satisfies inequality T p (cn 2/p 1 ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
101 Comparison to I (W ; S) In general, not comparable Varun Jog (UW-Madison) Information theory in learning May 8, / 35
102 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
103 Comparison to I (W ; S) In general, not comparable If µ satisfies a T 2 (c)-transportation inequality, can directly compare: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 2, then W 2 (P S, P S W ) 2cKL(P S W P S ) and so L n 1 2 ( W ) 1 W c (P S, P S w )dp W (w) L n I (P S; P W ) In particular, for Gaussian data, Wasserstein bound strictly stronger Varun Jog (UW-Madison) Information theory in learning May 8, / 35
104 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Varun Jog (UW-Madison) Information theory in learning May 8, / 35
105 Comparison to I (W ; S) If µ satisfies a T 1 (c)-transportation inequality: Theorem (Tovar-Lopez & J., (2018)) Suppose p = 1, then W 1 (P S, P S W ) 2cn KL(P S W P S ) and so L 2c W 1 (P S, P n S w )dp W (w) L W n I (P S; P W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
106 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Varun Jog (UW-Madison) Information theory in learning May 8, / 35
107 Coupling based bound on gen(µ, P W S ) Recall generalization error expression: gen(µ, P W S ) = El N ( S, W ) El N (S, W ), where ( S, W ) P S P W and (S, W ) P WS. Key insight: Any coupling of ( S, W, S, W ) that has the correct marginals on (S, W ) and ( S, W ) leads to the same expected value above Varun Jog (UW-Madison) Information theory in learning May 8, / 35
108 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
109 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Varun Jog (UW-Madison) Information theory in learning May 8, / 35
110 Proof sketch We have gen(µ, P W S ) = l N (s, w)dp SW l N ( s, w)dp S W = E SW S W l N (S, W ) l N ( S, W ) Pick W = W, use Lipschitz property in x Pick optimal joint distribution of P S, S W to minimize bound Varun Jog (UW-Madison) Information theory in learning May 8, / 35
111 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35
112 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Varun Jog (UW-Madison) Information theory in learning May 8, / 35
113 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Varun Jog (UW-Madison) Information theory in learning May 8, / 35
114 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Varun Jog (UW-Madison) Information theory in learning May 8, / 35
115 Speculations: Forward and backward channels Stability: How much does W change with S changes a little? Property of the forward channel P W S Generalization: How much does S change when W changes a little? Property of the backward channel P S W Pre-process data to deliberately make backward channel noisy (data augmentation, smoothing, etc.) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
116 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Varun Jog (UW-Madison) Information theory in learning May 8, / 35
117 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Varun Jog (UW-Madison) Information theory in learning May 8, / 35
118 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
119 Speculations: Relation to rate distortion theory Branch of information theory dealing with lossy data compression X Y min P Y X Ed(X, Y ) subject to I(X; Y ) apple R Minimize distortion given by l N (W, S) subject to mutual information constraint I (W ; S) ɛ Existing theory applies to d(x n, y n ) = i d(x i, y i ); however, we have l(w, x n ) := i l(w, x i ) Essentially same problem, but connections still unclear Varun Jog (UW-Madison) Information theory in learning May 8, / 35
120 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Varun Jog (UW-Madison) Information theory in learning May 8, / 35
121 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Varun Jog (UW-Madison) Information theory in learning May 8, / 35
122 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Varun Jog (UW-Madison) Information theory in learning May 8, / 35
123 Open problems Evaluating Wasserstein bounds for specific cases, in particular for SGD Information theoretic lower bounds on generalization error? Wasserstein bounds rely on new notion of information I W (X, Y ) = W (P X P Y, P XY ) Chain rule? Data processing? Varun Jog (UW-Madison) Information theory in learning May 8, / 35
124 Thank you! Varun Jog (UW-Madison) Information theory in learning May 8, / 35
Wasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationRefined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities
Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationExercises with solutions (Set D)
Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationStability of boundary measures
Stability of boundary measures F. Chazal D. Cohen-Steiner Q. Mérigot INRIA Saclay - Ile de France LIX, January 2008 Point cloud geometry Given a set of points sampled near an unknown shape, can we infer
More informationHow hard is this function to optimize?
How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize
More informationKyle Reing University of Southern California April 18, 2018
Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationChapter 2: Fundamentals of Statistics Lecture 15: Models and statistics
Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationThird-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationRobust high-dimensional linear regression: A statistical perspective
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationInformation Theory. Coding and Information Theory. Information Theory Textbooks. Entropy
Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationContents 1. Introduction 1 2. Main results 3 3. Proof of the main inequalities 7 4. Application to random dynamical systems 11 References 16
WEIGHTED CSISZÁR-KULLBACK-PINSKER INEQUALITIES AND APPLICATIONS TO TRANSPORTATION INEQUALITIES FRANÇOIS BOLLEY AND CÉDRIC VILLANI Abstract. We strengthen the usual Csiszár-Kullback-Pinsker inequality by
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationWays to make neural networks generalize better
Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationScenario estimation and generation
October 10, 2004 The one-period case Distances of Probability Measures Tensor products of trees Tree reduction A decision problem is subject to uncertainty Uncertainty is represented by probability To
More informationNEW FUNCTIONAL INEQUALITIES
1 / 29 NEW FUNCTIONAL INEQUALITIES VIA STEIN S METHOD Giovanni Peccati (Luxembourg University) IMA, Minneapolis: April 28, 2015 2 / 29 INTRODUCTION Based on two joint works: (1) Nourdin, Peccati and Swan
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationx log x, which is strictly convex, and use Jensen s Inequality:
2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationOn the fast convergence of random perturbations of the gradient flow.
On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationLearning Patterns for Detection with Multiscale Scan Statistics
Learning Patterns for Detection with Multiscale Scan Statistics J. Sharpnack 1 1 Statistics Department UC Davis UC Davis Statistics Seminar 2018 Work supported by NSF DMS-1712996 Hyperspectral Gas Detection
More informationLecture 22: Final Review
Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationUnraveling the mysteries of stochastic gradient descent on deep neural networks
Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationDeep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University
More informationShape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008.
Shape Matching: A Metric Geometry Approach Facundo Mémoli. CS 468, Stanford University, Fall 2008. 1 General stuff: class attendance is mandatory will post possible projects soon a couple of days prior
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationA random perturbation approach to some stochastic approximation algorithms in optimization.
A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationHow to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India
How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationOn Common Information and the Encoding of Sources that are Not Successively Refinable
On Common Information and the Encoding of Sources that are Not Successively Refinable Kumar Viswanatha, Emrah Akyol, Tejaswi Nanjundaswamy and Kenneth Rose ECE Department, University of California - Santa
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationAlgorithms other than SGD. CS6787 Lecture 10 Fall 2017
Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationFunctional Properties of MMSE
Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationStability results for Logarithmic Sobolev inequality
Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationA picture of the energy landscape of! deep neural networks
A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationStochastic relations of random variables and processes
Stochastic relations of random variables and processes Lasse Leskelä Helsinki University of Technology 7th World Congress in Probability and Statistics Singapore, 18 July 2008 Fundamental problem of applied
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More information