Understanding Deep Learning via Physics:

Size: px

Start display at page:

Download "Understanding Deep Learning via Physics:"

June Merritt
5 years ago
Views:

1 Understanding Deep Learning via Physics: The use of Quantum Entanglement for studying the Inductive Bias of Convolutional Networks Nadav Cohen Institute for Advanced Study Symposium on Physics and Machine Learning The City University of New York Graduate Center 15 December 2017 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 1 / 55

2 Sources Deep SimNets C, Sharir and Shashua Computer Vision and Pattern Recognition (CVPR) 2016 On the Expressive Power of Deep Learning: A Tensor Analysis C, Sharir and Shashua Conference on Learning Theory (COLT) 2016 Convolutional Rectifier Networks as Generalized Tensor Decompositions C and Shashua International Conference on Machine Learning (ICML) 2016 Inductive Bias of Deep Convolutional Networks through Pooling Geometry C and Shashua International Conference on Learning Representations (ICLR) 2017 Tensorial Mixture Models Sharir, Tamari, C and Shashua arxiv preprint 2017 Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions C, Tamari and Shashua arxiv preprint 2017 Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design Levine, Yakira, C and Shashua arxiv preprint 2017 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 2 / 55

3 Collaborators Or Sharir Yoav Levine Amnon Shashua Ronen Tamari David Yakira Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 3 / 55

4 Outline Perspective: Understanding Deep Learning 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 4 / 55

5 Perspective: Understanding Deep Learning Statistical Learning Setup X instance space (e.g. R for 100-by-100 grayscale images) Y label space (e.g. R for regression or [k] := {1,..., k} for classification) D distribution over X Y (unknown) l : Y Y R 0 loss func (e.g. l(y, ŷ) = (y ŷ) 2 for Y = R) Task Given training sample S = {(X 1, y 1 ),..., (X m, y m )} drawn i.i.d. from D, return hypothesis (predictor) h : X Y that minimizes population loss: L D (h) := E (X,y) D [l(y, h(x))] Approach Predetermine hypotheses space H Y X, and return hypothesis h H that minimizes empirical loss: L S (h) := E (X,y) S [l(y, h(x))] = 1 m m i=1 l(y i, h(x i )) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 5 / 55

6 Perspective: Understanding Deep Learning Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization f * h Training Error (Optimization) * h S * h Estimation Error (Generalization) Approximation Error (Expressiveness) fd ground truth (argmin f Y X L D(f )) hd optimal hypothesis (argmin h H L D (h)) hs empirically optimal hypothesis (argmin h H L S (h)) h returned hypothesis Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 6 / 55

7 Perspective: Understanding Deep Learning Classical Machine Learning h Well developed theory Training Error (Optimization) * h S h Estimation Error (Generalization) * * f Approximation Error (Expressiveness) Optimization Empirical loss minimization is a convex program: h hs ( training err 0 ) Expressiveness & Generalization Bias-variance trade-off: H approximation err estimation err expands shrinks Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 7 / 55

8 Perspective: Understanding Deep Learning Deep Learning h Not well understood Training Error (Optimization) * h S h Estimation Error (Generalization) * * f Approximation Error (Expressiveness) Optimization Empirical loss minimization is a non-convex program: hs is not unique many hypotheses have low training err Stochastic Gradient Descent somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don t W/typical data, solution returned by SGD often generalizes well Expanding H reduces approximation err, but also estimation err! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 8 / 55

9 Convolutional Networks as Hierarchical Tensor Decompositions Outline 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 9 / 55

for images/video, nowadays for audio and text as well

10 Convolutional Networks as Hierarchical Tensor Decompositions Convolutional Networks Most successful deep learning arch to date! Classic structure: Modern variants: Traditionally used for images/video, nowadays for audio and text as well Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

11 Convolutional Networks as Hierarchical Tensor Decompositions Tensor Product of L 2 Spaces ConvNets realize func over many local elements (e.g. pixels, audio samples) Let R s be the space of such elements (e.g. R 3 for RGB pixels) Consider: L 2 (R s ) space of func over single element L 2 ((R s ) N ) space of func over N elements Fact L 2 ((R s ) N ) is equal to the tensor product of L 2 (R s ) with itself N times: L 2 ((R s ) N ) = L 2 (R s ) L 2 (R s ) }{{} N times Implication If {f d (x)} d=1 is a basis1 for L 2 (R s ), the following is a basis for L 2 ((R s ) N ): { (x 1,..., x N ) } N f i=1 d i (x i ) 1 Set of linearly independent func w/dense span d 1...d N =1 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

12 Convolutional Networks as Hierarchical Tensor Decompositions Coefficient Tensor For practical purposes, restrict L 2 (R s ) basis to a finite set: f 1 (x)...f M (x) We call f 1 (x)...f M (x) descriptors General func over N elements can now be written as: h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) w/func fully determined by the coefficient tensor: N times {}}{ M M A R Example 100-by-100 images (N = 10 4 ) pixels represented by 256 descriptors (M = 256) Then, func over images correspond to coeff tensors of: order 10 4 dim 256 in each mode Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

13 Convolutional Networks as Hierarchical Tensor Decompositions Decomposing Coefficient Tensor Convolutional Arithmetic Circuit h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) Coeff tensor A is exponential (in # of elements N) = directly computing a general func is intractable Observation Applying hierarchical decomposition to coeff tensor gives ConvNet w/linear activation and product pooling (Convolutional Arithmetic Circuit)! decomposition type (mode tree, internal ranks etc) network structure (depth, width, pooling etc) decomposition parameters network weights Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

14 Convolutional Networks as Hierarchical Tensor Decompositions Example 1: CP Decomposition Shallow Network h(x 1,..., x N ) = M W/CP decomposition applied to coeff tensor: d 1...d N =1 A d 1...d N N i=1 f d i (x i ) A = r 0 γ=1 a1,1,y γ a 0,1,γ a 0,2,γ a 0,N,γ func is computed by shallow network (single hidden layer, global pooling): hidden layer input X representation 1x1 conv x i, x 0, j, conv j, a, rep j,: pool conv j, rep i d f d i global pooling M r r 0 0 j covers space dense (output) out y a 1,1, y Y, pool : Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

15 Convolutional Networks as Hierarchical Tensor Decompositions Example 2: HT Decomposition Deep Network h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) W/Hierarchical Tucker (HT) decomposition applied to coeff tensor: φ 1,j,γ = r 0 α=1 a1,j,γ α a 0,2j 1,α a 0,2j,α φ l,j,γ = r l 1 α=1 al,j,γ α φ l 1,2j 1,α φ l 1,2j,α A = r L 1 φ L 1,1,α φ L 1,2,α α=1 al,1,y α func is computed by deep network w/size-2 pooling windows: hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log 2 N) x i rep i, d f d xi M r 0 0 1x1 conv pooling r rl 1 r L 1 0, j, conv0 j, a, rep j,: pooll 1 convl 1 j ', j ' 1,2 pool0 j, conv0 j ', out y a j ' 2 j 1,2 j L,1, dense (output) Y, pooll 1 : Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55 y

16 Convolutional Networks as Hierarchical Tensor Decompositions Generalization to Other Types of Convolutional Networks We established equivalence: hierarchical tensor decompositions conv arith circuits (ConvACs) ConvACs deliver promising empirical results, 1 but other types of ConvNets (e.g. w/relu activation and max/ave pooling) are much more common The equivalence extends to other types of ConvNets if we generalize the notion of tensor product: 2 Tensor product: (A B) d1...d P+Q = A d1...d P B dp+1...d P+Q Generalized tensor product: (A g B) d1...d P+Q := g(a d1...d P, B dp+1...d P+Q ) (same as but w/general g : R R R instead of mult) 1 Deep SimNets, CVPR 16 ; Tensorial Mixture Models, arxiv 17 2 Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML 16 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

17 Outline Expressive Efficiency 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

18 Expressiveness Expressive Efficiency f * h * h S * h Approximation Error (Expressiveness) fd ground truth (argmin f Y X L D(f )) hd optimal hypothesis (argmin h H L D (h)) hs empirically optimal hypothesis (argmin h H L S (h)) h returned hypothesis Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

19 Expressive Efficiency Expressive Efficiency (informal) Expressive efficiency compares network arch in terms of their ability to compactly represent func Let: H A space of func compactly representable by network arch A H B - - network arch B A is efficient w.r.t. B if H A is a strict superset of H B H A H B A is completely efficient w.r.t. B if H B has zero volume inside H A H A H B Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

20 Expressive Efficiency Expressive Efficiency Formal Definition Network arch A is efficient w.r.t. network arch B if: (1) func realized by B w/size r B can be realized by A w/size r A O(r B ) (2) func realized by A w/size r A requiring B to have size r B Ω(f (r A )), where f ( ) is super-linear A is completely efficient w.r.t. B if (2) holds for all its func but a set of Lebesgue measure zero (in weight space) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

21 Outline Expressive Efficiency Efficiency of Depth 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

22 Efficiency of Depth Expressive Efficiency Efficiency of Depth Longstanding conjecture Efficiency of depth: deep ConvNets realize func that require shallow ConvNets to have exponential size (width) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

23 Expressive Efficiency Efficiency of Depth Tensor Decomposition Viewpoint h(x 1,..., x N ) = M A N d 1...d N =1 d 1...d f N i=1 d i (x i ) Shallow Network input rep conv global pool dense r 0 A = r 0 CP Decomposition γ=1 a1,1,y γ a 0,1,γ a 0,N,γ input rep conv Deep Network pool conv pool dense r0 φ 1,j,γ = α=1 rl 1 φ l,j,γ = α=1 rl 1 A = α=1 HT Decomposition a 1,j,γ α a l,j,γ α a L,1,y α a 0,2j 1,α a 0,2j,α φ l 1,2j 1,α φ l 1,2j,α φ L 1,1,α φ L 1,2,α Efficiency of depth HT decomposition realizes tensors that require CP decomposition to have exponential rank (r 0 exponential in N) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

24 HT vs. CP Analysis Expressive Efficiency Efficiency of Depth Theorem Besides a negligible (zero measure) set, all parameter settings for HT decomposition lead to tensors w/cp-rank exponential in N φ 1,j,γ = r 0 φ l,j,γ = r l 1 A HT Decomposition α=1 a1,j,γ α α=1 al,j,γ α = r L 1 α=1 al,1,y α CP Decomposition a 0,2j 1,α a 0,2j,α φ l 1,2j 1,α φ l 1,2j,α φ L 1,1,α φ L 1,2,α A = r 0 γ=1 a1,1,y γ a 0,1,γ a 0,N,γ Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

25 Expressive Efficiency HT vs. CP Analysis (cont d) Theorem proof sketch Efficiency of Depth A matricization of A (arrangement of tensor as matrix) Kronecker product for matrices. Holds: rank(a B) = rank(a) rank(b) Relation between tensor and Kronecker products: A B = A B Implies: rank A CP-rank(A) By induction over levels of HT, rank A is exponential almost always: Base: SVD has maximal rank almost always Step: rank A B = rank( A B ) = rank A rank B, and linear combination preserves rank almost always HT Decomposition φ 1,j,γ r = 0 α=1 a1,j,γ α a 0,2j 1,α a 0,2j,α φ l,j,γ r = l 1 α=1 al,j,γ α φ l 1,2j 1,α φ l 1,2j,α r A = L 1 α=1 al,1,y α φ L 1,1,α φ L 1,2,α Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

26 Expressive Efficiency HT vs. CP Analysis (cont d) Corollary Efficiency of Depth Randomizing weights of deep ConvAC by a cont distribution leads, w.p. 1, to func that require shallow ConvAC to have exponential # of channels rep i, d f d xi Deep Network hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log2N) x i 1x1 conv pooling ' ' M r 0 r 0 rl 1 r L 1 conv0 j, a rep j,: pool conv j ', 0, j,, L 1 L 1 pool0 j, j 1,2 conv0 j ', out y a 1,2 j L,1, y 2 j j Shallow Network hidden layer input X representation 1x1 conv x i M r0 rep i, d f d xi 0, j, conv j, a, rep j,: pool conv j, j covers space global pooling r0 dense (output) Y out y 1,1, y a, pool : dense (output) Y, pooll 1 : W/ConvACs efficiency of depth holds completely! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

27 Expressive Efficiency Efficiency of Depth HT vs. CP Analysis Generalizations HT vs. CP analysis may be generalized in various ways, e.g.: Comparison between arbitrary depths Penalty in resources is double-exponential w.r.t. # of layers cut-off O ( r N/2) f(l) = O (r 2L l) # of parameters optimal 1 # of layers Adaptation to other types of ConvNets W/ReLU activation and max pooling, deep nets realize func requiring shallow nets to be exponentially large, but not almost always Efficiency of depth is incomplete w/relu ConvNets! L Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

28 Outline Expressive Efficiency Efficiency of Interconnectivity 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

schemes: Inception (GoogLeNet) ResNet DenseNet Question Can such connectivities lead to

29 Efficiency of Interconnectivity Expressive Efficiency Efficiency of Interconnectivity Classic ConvNets have feed-forward (chain) structure: Modern ConvNets employ elaborate connectivity schemes: Inception (GoogLeNet) ResNet DenseNet Question Can such connectivities lead to expressive efficiency? Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

30 Expressive Efficiency Efficiency of Interconnectivity Dilated Convolutional Networks We focus on dilated ConvNets (D-ConvNets) for sequence data: output L-1 hidden layers input r L rl 1 r 2 r 1 r 0 Time t-2 L t-2 L +1 t-2 L +2 t-3 t-2 t-1 t t+1 1D ConvNets No pooling Dilated (gapped) conv windows N:=2 L time points size-2 conv: dilation-2 L-1 size-2 conv: dilation-2 size-2 conv: dilation-1 Underlie Google s WaveNet & ByteNet state of the art for audio & text! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

31 Expressive Efficiency Dilations and Mode Trees Efficiency of Interconnectivity W/D-ConvNet, mode tree underlying corresponding tensor decomposition determines dilation scheme {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-8 dilation-4 dilation-2 dilation-1 {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,3} {2,4} {5,7} {6,8} {9,11} {10,12} {13,15} {14,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-4 dilation-8 dilation-1 dilation-2 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

32 Mixed Tensor Decompositions Expressive Efficiency Efficiency of Interconnectivity Let: T, T mode trees ; mix(t, T ) set of nodes present in both trees mode tree T {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} mix(t,t) {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} mode tree T {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,3} {2,4} {5,7} {6,8} {9,11} {10,12} {13,15} {14,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} A mixed tensor decomposition blends together T and T by running their decompositions in parallel, exchanging tensors in each node of mix(t, T ) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

33 Expressive Efficiency Efficiency of Interconnectivity Mixed Dilated Convolutional Networks Mixed tensor decomposition corresponds to mixed D-ConvNet, formed by interconnecting the networks of T and T : output dilation-8 network of T dilation-4 dilation-2 dilation-1 input dilation-2 network of T dilation-1 dilation-8 dilation-4 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

34 Expressive Efficiency Mixture Expressive Efficiency Theorem Efficiency of Interconnectivity Mixed tensor decomposition of T and T can generate tensors that require individual decompositions to grow quadratically (in terms of their ranks) Corollary Mixed D-ConvNet can realize func that require individual networks to grow quadratically (in terms of layer widths) Experiment 0.72 TIMIT Individual Phoneme Classification Accuracy Validation Set Train Set Connections up to layer Interconnectivity can lead to expressive efficiency! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

35 Outline Inductive Bias via Quantum Entanglement 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

36 Inductive Bias via Quantum Entanglement Inductive Bias Networks of reasonable size can only realize a fraction of all possible func Expressive efficiency does not explain why this fraction is effective all func Why are these functions interesting? H A H B To explain the effectiveness, one must consider the inductive bias: Not all func are equally useful for a given task Network only needs to represent useful func Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

Inductive Bias via Quantum Entanglement Inductive Bias via Physics Unlike expressive efficiency, inductive bias can t be studied via math alone it requires reasoning

37 Inductive Bias via Quantum Entanglement Inductive Bias via Physics Unlike expressive efficiency, inductive bias can t be studied via math alone it requires reasoning about nature of real-world tasks Physics bears the potential to bridge this gap! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

38 Inductive Bias via Quantum Entanglement Modeling Interactions ConvNets realize func over many local elements (e.g. pixels, audio samples) Key property of such func: interactions modeled between different sets of elements Modeling strong interaction between yellow and blue pixels is important here Less important here Partition A Partition B Questions What kind of interactions do ConvNets model? How do these depend on network structure? Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

39 Inductive Bias via Quantum Entanglement Quantum Entanglement N-1 N In quantum physics, state of particle is represented as vec in Hilbert space: particle state = M a d=1 d }{{} coeff ψ d H }{{} basis System of N particles is represented as vec in tensor product space: system state = M A d 1...d N =1 d 1...d }{{ N ψ d1 ψ dn H H }}{{} coeff tensor N times Quantum entanglement measures quantify interactions that a system state models between sets of particles Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

40 Inductive Bias via Quantum Entanglement Quantum Entanglement (cont d) system state = M d 1...d N =1 A d 1...d N ψ d1 ψ dn N-1 N c Consider partition of the N particles into sets I and I c A I matricization of coeff tensor A w.r.t. I: arrangement of A as matrix rows/cols correspond to modes indexed by I/I c Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

41 Inductive Bias via Quantum Entanglement Quantum Entanglement (cont d) N-1 N c system state = M d 1...d N =1 A d 1...d N ψ d1 ψ dn A I matricization of A w.r.t. I Let σ = (σ 1, σ 2,..., σ R ) be the singular vals of A I Entanglement measures between particles of I and of I c are based on σ: Entanglement Entropy: entropy of (σ 2 1,..., σ2 R )/ σ 2 2 Geometric Measure: 1 σ 2 1 / σ 2 2 Schmidt Number: σ 0 = rank A I Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

Inductive Bias via Quantum Entanglement Entanglement with ConvACs Structural equivalence: quantum system (many-body) state M system state = A d1...d N ψ d1 ψ dn }{{} d 1.

42 Inductive Bias via Quantum Entanglement Entanglement with ConvACs Structural equivalence: quantum system (many-body) state M system state = A d1...d N ψ d1 ψ dn }{{} d 1...d N =1 coeff tensor func realized by ConvAC M h(x 1,..., x N ) = A d1...d N f d1 (x 1 ) f dn (x N ) }{{} d 1...d N =1 coeff tensor state of many particles func over many pixels We may quantify interactions ConvAC models between input sets by applying entanglement measures to its coeff tensor! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

43 Inductive Bias via Quantum Entanglement Entanglement with ConvACs Interpretation When func h realized by ConvAC is separable w.r.t. input sets I/I c : g, g s.t. h(x 1,..., x N ) = g ((x i ) i I ) g ((x i ) i I c ) it does not model any interaction between the input sets In a stat setting, this corresponds to independence of (x i ) i I and (x i ) i I c Entanglement measures on coeff tensor of h quantify dist from separability: A has high (low) entanglement w.r.t. I/I c = h is far from (close to) separability w.r.t. I/I c Choice of entanglement measure determines distance metric Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

tensor Edge (mode) connecting two vertices (tensors) represents contraction inner-product between vectors matrix

44 Inductive Bias via Quantum Entanglement Quantum Tensor Networks Coeff tensors of quantum many-body states are simulated via: Tensor Networks Tensor Networks (TNs): Graphs in which: vertices tensors edges modes scalar vector matrix order-3 tensor Edge (mode) connecting two vertices (tensors) represents contraction inner-product between vectors matrix multiplication edges weighted by mode dimensions Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

45 Inductive Bias via Quantum Entanglement ConvACs as Tensor Networks Coeff tensor of ConvAC may be represented via TN: input rep conv pool conv pool dense M r 0 r 0 r L-1 r L-1 open nodes correspond to ConvAC inputs (e.g. pixels) r L-1 r L-1 tree structure corresponds to ConvAC pooling geometry r 1 r 1 r 1 r 1 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 M M M M M M M M rl-1 edge weights correspond to ConvAC layer widths Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

46 Inductive Bias via Quantum Entanglement Entanglement via Minimal Cuts Theorem ( Quantum Max Flow/Min Cut ) Maximal Schmidt entanglement ConvAC models between input sets I/I c is equal to min cut in respective TN separating nodes of I/I c ConvAC entanglement between input sets TN min cut separating respective node sets = c c Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

47 Inductive Bias via Quantum Entanglement Controlling Entanglement (Interactions) Corollary Controlling entanglement (interactions) modeled by ConvAC is equivalent to controlling min cuts in respective TN open nodes correspond to ConvAC inputs (e.g. pixels) r L-1 r 1 r 1 r 1 r 1 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r L-1 tree structure corresponds to ConvAC pooling geometry M M M M M M M M Two sources of control: layer widths, pooling geometry rl-1 edge weights correspond to ConvAC layer widths We may analyze the effect of ConvAC arch on the interactions (entanglement) it can model! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

48 Inductive Bias via Quantum Entanglement Controlling Interactions Layer Widths Claim Deep (early) layer widths are important for long (short)-range interactions Experiment Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

49 Inductive Bias via Quantum Entanglement Controlling Interactions Pooling Geometry Claim Input elements pooled together early have stronger interaction Experiment closedness: low symmetry: low square pooling (local interactions) closedness: high symmetry: low closedness: low symmetry: high closedness: high symmetry: high mirror pooling (interactions between reflections) closedness task symmetry task Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

50 Outline Conclusion 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

51 Conclusion Conclusion Three pillars of statistical learning theory: Expressiveness Generalization Optimization Well developed theory for classical ML Limited understanding for Deep Learning We derive equivalence: ConvNets hierarchical tensor decompositions and use it to analyze expressive efficiency To understand expressiveness, efficiency is not enough one must consider the inductive bias: This cannot be done via math alone Physics bears the potential to bridge this gap! We use Quantum Entanglement and Tensor Networks to study ConvNets ability to model interactions Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

52 Future Possibilities Conclusion Further studying inductive bias of ConvNets via Quantum Physics: Understanding overlapping operations via MERA disentanglers Characterizing correlations (interactions) in natural data-sets Transfer of computational tools between Deep Learning and Physics: Training ConvNets w/tensor Network algorithms (e.g. DMRG) Quantum Computation (wave function reconstruction) w/sgd AND MUCH MORE... Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

53 Conclusion Recent Development (Levine et al): From ConvNets to Recurrent Neural Networks (RNNs) RNNs most successful Deep Learning arch for sequence processing Start-End Entanglement quantifies long-term memory of a network Authors analyze this via TNs (MPS and generalizations): Show Start-End Entanglement increases exponentially w/depth! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

54 Outline 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

55 Thank You Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

Expressive Efficiency and Inductive Bias of Convolutional Networks:

Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis & Design via Hierarchical Tensor Decompositions Nadav Cohen The Hebrew University of Jerusalem AAAI Spring Symposium Series