Optimal Deep Learning and the Information Bottleneck method

Size: px

Start display at page:

Download "Optimal Deep Learning and the Information Bottleneck method"

Elvin Lewis
6 years ago
Views:

School of Engineering and Computer Science The Edmond & Lily

1 1 Optimal Deep Learning and the Information Bottleneck method ICRI-CI retreat, Haifa, May 2015 Naftali Tishby Noga Zaslavsky School of Engineering and Computer Science The Edmond & Lily Safra Center for Brain Sciences Hebrew University, Jerusalem, Israel

2 Outline Deep Neural Networks and Deep Learning What are Deep Neural Networks (DNN)? The incredible success of DNN s Theoretical challenges The Information Bottleneck method Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions IB bifurcations Bifurcation Theory of Deep Neural Networks Statistical characterizations of Neural Nets Learning optimality and sample complexity bound The connection between NN layers and IB phase transitions Design principles for optimal DNN s 3

3 Deep Learning: Neural-Nets strike back 3

4 4

5 We desperately need a Theory Why DNN s work so well? How can they be improved? Optimality bounds What is an optimal DNN? Sample and computational complexity bounds Design principles What determines the number & width of the layers? What determines the connectivity and inter-layer connections? Interpretability What do the layers/neurons capture/represent? Better learning algorithms Is stochastic gradient descent the best we can do? 5

6 Deep Neural Nets and Information Theory?? From causal to predictive systems 6

7 Outline Deep Neural Networks and Deep Learning What are Deep Neural Networks (DNN)? The incredible success of DNN s Theoretical challenges The Information Bottleneck method Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions IB bifurcations Bifurcation Theory of Deep Neural Networks Statistical characterizations of Neural Nets Learning optimality bound The connection between NN layers and IB phase transitions Design principles for optimal DNN s 3

8 The Information Bottleneck Method (Tishby, Pereira, Bialek, 1999) (1) Approximate Minimal Sufficient Statistics: Markov chain: Y X S( X ) Xˆ arg min I( S( X ); X ) S ( X ): I ( S ( X ); Y ) I ( X ; Y ) Relaxation - given p( X, Y ) : p( xˆ x) Xˆ Xˆ arg min I( Xˆ ; X ) I( Xˆ ; Y ), (Shamir, Sabato,T., TCS 2010) 0 (2) A Rate-Distortion problem with KL- divergence distortion: d ( x, x ˆ) [ ( ) ( ˆ IB D p y x p y x )] (Bachrach, Navot,T., COLT 2006) (3) The ONLY distributional qun a tization measure which satisfy both DPI ( Harremoes-T., ISIT 2008) (f-divergences) and Statistical Consistency (Bregman divergences) 8

9 The Information Bottleneck Method (Tishby, Pereira, Bialek, 1999) The IB optimality/stationarity equations: min ( ˆ; ) ( ˆ; ), 0 p( xˆ x):y X Xˆ I X X I X Y px ( ) p( x xˆ) exp( D[ p( y x) p( y xˆ]) Z( x, ) Z( x, ) ˆ ( ˆ)exp( [ ( ) ( ˆ x p x D p y x p y x]) p( xˆ) ( ˆ x p x x) p( x) py ( xˆ ) (y ) ( ˆ x p x p x x) Solved b y Arimoto-Blahut like iterations, but with possibly sub-optimal solutions (!), similar to K-means distributional-clustering with centroids update. 9

10 I T3 F (I T3 P) I T2 F (I T2 P) I( XY ˆ ; ) I T1 F (I T1 P) The limit is always RD like concave envelope with sub-optimal bifurcations I( X; Xˆ ) 10

11 Critical points are 2 nd order phase transitions 11

12 The IB bifurcation (phase-transitions) points The IB bifurcation points can be found as follows: px ( ) ln p( x xˆ) ln D[ p( y x) p( y xˆ)] Z( x, ) then: ln p( x xˆ) ln p(y xˆ) py ( x) y xˆ xˆ ln p(y xˆ ) 1 ln p( x xˆ ) p( y x) p( x xˆ ) x xˆ p( y xˆ) xˆ these equations can be combined into two (non-linear) eigenvalue problems: ln p( x xˆ ) I CX ( xˆ, ) 0 xˆ ln p(y xˆ ) I CY ( xˆ, ) 0 xˆ These eigenvalue problems have non-trivial solutions (eigenvectors) only at the critical bifurcation points (second order phase transitions). 12

13 1 1 ( xˆ ) ( C ( xˆ, ) ( C ( xˆ, ) c 2 X c 2 Y c 13

14 IB bifurcation diagram 1 1 ( xˆ ) ( C ( xˆ, ) ( C ( xˆ, ) c 2 X c 2 Y c 14

15 Outline Deep Neural Networks and Deep Learning What are Deep Neural Networks (DNN)? The incredible success of DNN s Theoretical challenges The Information Bottleneck method Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions IB bifurcations Bifurcation Theory of Deep Neural Networks Statistical characterizations of Neural Nets Learning optimality and sample complexity bound The connection between NN layers and IB phase transitions Design principles for optimal DNN s 3

16 DNN s and the Information Bottleneck Linearly separable units, (pos. stochastic): ln p(h h ) h W h b(h ) T i i i1 i1 i i ln p(h i h i1) h i1 i Wh i Near the optimal IB curve: The inter-layer mapping is of exponential form : ln p(h h ) h W h a(h ) T i i1 i i1 i i1 ln p(h i h i1) ln px ( xˆ ) h xˆ i1 i W - is the i-th layer connection matrix. with h i a x and h i1 a xˆ 16

17 DNN s and the Information Bottleneck Near the optimal IB curve: ln p(h i h i1) ln px ( xˆ ) h xˆ i1 with h i a x and h i1 a xˆ But on the optimal IB curve there is a non-trivial derivative only at the IB bifurcation points: ln p(h i h i1) = Wh i i hi (h, ) where v ( x h, ) is the second eigenvector of the bifurcation matrix: i c ( ˆ C ˆ v 2 ln p( x xˆ ) I CX x, ) I c X ( x, c) ( x, c ) 0 xˆ v i c This provides an equation for the optimal weights: i v2( h i, c) W h i h i 0 17

18 Optimal design principles Output layer Hidden layers 18

19 Sample complexity bounds 20

20 Real DNN s on the IB plane 21

21 Summary An Information Theory of Deep Neural Networks Based on the Information Bottleneck (IB) tradeoff Uniquely and consistently quantifies the hidden layers The optimal hidden layers correspond to IB bifurcation points New spectral algorithm for finding the IB bifurcation points Determines number and width of the optimal DNN layers New spectral learning rule: Weights are derived from 2 nd eigenvector New design principles and finite sample complexity bounds Network structure is determined from the bifurcation diagram Finite sample bounds from mutual-information estimation bounds Stochastic networks are proved to be optimal (in terms of complexity) Possible implications on real (biological) layered networks. 21

Deep Learning and Information Theory

Deep Learning and Information Theory Bhumesh Kumar (13D070060) Alankar Kotwal (12D070010) November 21, 2016 Abstract T he machine learning revolution has recently led to the development of a new flurry