the Information Bottleneck

Size: px

Start display at page:

Download "the Information Bottleneck"

Belinda Stafford
5 years ago
Views:

1 the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California

2 Sorry, no Neuroimaging! (at least not presented) 0

3 Instead, a deep dive into a relevant topic. 0

4 Instead, a medium dive into a relevant topic. 0

5 Rate-Distortion Theory

6 Sending a signal 1

7 Sending a signal Send: X = NEURON 1

8 Sending a signal Binary Wire Send: X = NEURON 1

9 Sending a signal N E Naïve encoding Len log symbols Send: X = NEURON Rec: X = NEURON 1

10 Sending a signal X Lossless (Huffman) encoding Len 3.75 = 22.5 symbols Send: X = NEURON Rec: X = NEURON 1

11 Sending a signal X (Example) Lossy Encoding Len 3.75 = 17.5 symbols Send: X = NEURON Rec: X = NERON 1

12 Choosing a lossy code What makes a good encoding? (According to Claude Shannon) Low Rate: We want short messages. Low Distortion: We want our messages to still make sense. Objective: Learn the best encoding p : X X. Clearly this also depends on a measure of distortion d : X X R + 2

13 Preliminaries Entropy (description length): H(X) = E p(x) [ log p(x)] = x p(x) log p(x) Mutual Information (Transmission Rate ): I(X; Y) = E p(x,y) [log Technically a bound. p(x, y) p(x)p(y) ] = H(X) H(X Y) change in desc. length = H(Y) H(Y X) 3

14 Rate Distortion Theory What makes a good encoding? (According to Claude Shannon) Low Rate: We want short messages. Low Distortion: We want our messages to still make sense. Objective: minimize p( x x) subject to I(X; X) d(x, X) < D given d : X X R + and D. 4

15 Rate Distortion Theory Rate Distortion Theorem Define the function (Shannon and Kolmogorov) R(D) = min p( x x) s.t. d(x, x) D I(X; X) as the minimum achievable rate under distortion constraint D. Then an encoding that achieves this rate is p( x x) = p( x) exp[ βd(x, x)] Z(x, β) 5

16 Rate Distortion Theory Sketch: How did we get there? R(D) = min p( x x) s.t. d(x, x) D I(X; X) original problem (1) F[p( x x)] = I(X; X) + βe p( x,x) [d(x, x)] Lagrange Mult. (2) δf = 0 Opt. Cond. (3) δp( x x) log p( x x) p( x) 0 = p(x)[log p( x x) p( x) λ(x) = βd(x, x) p(x) + βd(x, x) + λ(x) p(x) ] p( x x) = p( x) exp( βd(x, x)) exp( λ(x) p(x) ) (4) 6

17 Rate Distortion Theory X Sketch: How did we get there? R(D) = min I(X; X) original problem (1) p( x x) s.t. d(x, x) D F[p( x x)] = I(X; X) + βe p( x,x) [d(x, x)] Lagrange Mult. (2) δf = 0 Opt. Cond. (3) δp( x x) 0 = p(x)[log p( x x) λ(x) + βd(x, x) + p( x) p(x) ] log p( x x) λ(x) = βd(x, x) p( x) p(x) p( x x) = p( x) exp( βd(x, x)) exp( λ(x) p(x) ) (4) 6

18 Rate Distortion Theory Sketch: How did we get there? 1. We have one unknown p( x x) for our desiderata minimize p( x x) I(X; X) 2. We have two constraints, d(x, x) D and x p( x x) = We can relax the first constraint and form a functional F[p( x x)] = I(X; X) + βe p( x,x) [d(x, x)] 4. We know about Lagrange Multipliers. 7

19 Computational Solutions Blahut Arimoto Algorithm Iterate these two functions p t+1 ( x) = x p(x)p t ( x x) p t+1 ( x x) = p t( x) exp[ βd(x, x)] Z(x, β) Further we are guaranteed convergence as F is convex. 8

20 8

21 Information Bottleneck

22 Bottleneck Slides X = Weather in Florida Y = Price of Oranges 9

23 Bottleneck Slides X Most Relevant Parts of X X = Weather in Florida Y = Price of Oranges 9

24 More Preliminaries Entropy (description length) under X p(x): H(X) = E p(x) [ log p(x)] = x p(x) log p(x) Mutual Information (Transmission Rate bound): I(X; Y) = E p(x,y) [log p(x, y) p(x)p(y) ] Kullback Leibler Divergence (misspecified description length): D KL [p q] = E p(x) [log p(x) log q(x)] = E p [ log q(x)] H(X) }{{} Cross-entropy 10

25 Information Bottleneck What makes a good encoding? (With respect to Y) Low Rate: We want short messages. High Relevance: We want our messages to be relevant to some Y outside variable. Objective: Given L, minimize p( x x) subject to I(X; X) I( X, Y) > L 11

26 Bottleneck Slides The Information Bottleneck Define the function R(L) = min p( x x) s.t. I( x;y) L (Tishby, Pereira, and Bialek) I(X; X) as the minimum achievable rate while preserving L bits of mutual information. Then an encoding that achieves this rate has the form p( x x) = p( x) Z(x, β) exp( βd KL[ p(y x) p(y x) ]) 12

27 Rate Distortion Theory Sketch: How did we get there? 1. We have one unknown p( x x) for our desiderata minimize p( x x) I(X; X) 2. We have three constraints, I( X; Y) L, x p( x x) = 1, and p(y x) = 1. y 3. We can relax the first constraint and form a functional F[p( x x)] = I(X; X) + βi( X; Y) 4. We know about Lagrange Multipliers. 13

28 Bottleneck Computation Information Bottleneck Algorithm Iterate these three functions p t+1 ( x x) = p t+1 ( x) = x p t( x) Z(x, β) exp[ βd KL[ p(y x) p t (y x) ])] p(x)p t ( x x) p t+1 (y x) = y p(y x)p t (x x) In general, this will only converge locally, but we have a bound on the amount of information still in X but not in X about Y, given by I(X; Y). 14

29 Bottleneck parameter Not shown, optimization trajectories. 15

30 Similar in effect to Clustering (receiving partition of x), but No guarantee of disentangled representations Optimizing over p( x x) Relevance! Has similar problems to solve (e.g. choosing the size of encoding x). 16

31 Multivariate Bottleneck

32 Multivariate IB Preliminaries: I(X 1,..., X n ) = D KL [P(X 1,..., X n ) P(X 1 ) P(X n )] [ = E P log P(X ] 1,..., X n ) P(X 1 ) P(X n ) P consistent with DAG G = P = G P(X 1,..., X n ) = i P(X i Parents G (X i )) 17

33 Multivariate IB If P = G then I(X 1,..., X n ) = i I(X i ; Parents G (X i )) In general D(P P G ) = i I(X i ; NotParents G (X i ) Parents G (X i )) = I(X 1,..., X n ) I G 18

34 Multivariate IB X Y X Y X X Graph G input. Graph G output. Using L = (1 + γ)i(g input ) γi(g output ) this produces the regular IB. 19

35 Multivariate IB X Y X Y T 1 T 2 T 1 T 2 20

36 Scaling this might be hard. 20

37 Modern Bottleneck

38 Deep Variational IB Alemi et al. 2016, very similar to Achilles & Soatto

39 Deep Variational IB Main ideas of Alemi et al./achilles & Soatto 2016: 1. Deep networks are great function approximators. 22

40 Deep Variational IB Main ideas of Alemi et al./achilles & Soatto 2016: 1. Deep networks are great function approximators. 2. We want to optimize p(ˆx x), but propagating error past the stochastic loss L(x, y, p) = p(y ˆx)p(ˆx y) is hard. 22

41 Deep Variational IB Main ideas of Alemi et al./achilles & Soatto 2016: 1. Deep networks are great function approximators. 2. We want to optimize p(ˆx x), but propagating error past the stochastic loss L(x, y, p) = p(y ˆx)p(ˆx y) is hard. 3. Using technology from Variational Autoencoders, we can propagate derivatives from p(y ˆx) to p(ˆx x). (The re-parameterization trick of Kingma & Welling 2014) 22

42 Deep Variational IB Main ideas of Alemi et al./achilles & Soatto 2016: 1. Deep networks are great function approximators. 2. We want to optimize p(ˆx x), but propagating error past the stochastic loss L(x, y, p) = p(y ˆx)p(ˆx y) is hard. 3. Using technology from Variational Autoencoders, we can propagate derivatives from p(y ˆx) to p(ˆx x). (The re-parameterization trick of Kingma & Welling 2014) 4. Problems: calculating Mutual Information is actually quite difficult. Using an independent KDE estimator here, perhaps not optimal. 22

43 Other recent developments 23

44 Connections to Deep Learning Stolen directly from Tishby and Zaslavsky

45 Main Points of Tishby Claims: 1. To learn is to forget irrelevant information. 2. The layers of a deep network are (iteratively) applying a bottleneck principle. 3. The final layers should hopefully have only relevant information. 4. (Not in the paper) Backprop training produces a learning pattern w.r.t. the bottleneck objectives. 25

46 Counter Arguments (Under Review) Abstract from Anon. ICLR Submission this year. 26

47 Shannon s Warning

48 Other recent developments 27

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1 The information bottleneck method arxiv:physics/0004057v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1 1 NEC Research Institute, 4 Independence Way Princeton,