Understanding Deep Learning via Physics:

Size: px
Start display at page:

Download "Understanding Deep Learning via Physics:"

Transcription

1 Understanding Deep Learning via Physics: The use of Quantum Entanglement for studying the Inductive Bias of Convolutional Networks Nadav Cohen Institute for Advanced Study Symposium on Physics and Machine Learning The City University of New York Graduate Center 15 December 2017 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 1 / 55

2 Sources Deep SimNets C, Sharir and Shashua Computer Vision and Pattern Recognition (CVPR) 2016 On the Expressive Power of Deep Learning: A Tensor Analysis C, Sharir and Shashua Conference on Learning Theory (COLT) 2016 Convolutional Rectifier Networks as Generalized Tensor Decompositions C and Shashua International Conference on Machine Learning (ICML) 2016 Inductive Bias of Deep Convolutional Networks through Pooling Geometry C and Shashua International Conference on Learning Representations (ICLR) 2017 Tensorial Mixture Models Sharir, Tamari, C and Shashua arxiv preprint 2017 Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions C, Tamari and Shashua arxiv preprint 2017 Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design Levine, Yakira, C and Shashua arxiv preprint 2017 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 2 / 55

3 Collaborators Or Sharir Yoav Levine Amnon Shashua Ronen Tamari David Yakira Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 3 / 55

4 Outline Perspective: Understanding Deep Learning 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 4 / 55

5 Perspective: Understanding Deep Learning Statistical Learning Setup X instance space (e.g. R for 100-by-100 grayscale images) Y label space (e.g. R for regression or [k] := {1,..., k} for classification) D distribution over X Y (unknown) l : Y Y R 0 loss func (e.g. l(y, ŷ) = (y ŷ) 2 for Y = R) Task Given training sample S = {(X 1, y 1 ),..., (X m, y m )} drawn i.i.d. from D, return hypothesis (predictor) h : X Y that minimizes population loss: L D (h) := E (X,y) D [l(y, h(x))] Approach Predetermine hypotheses space H Y X, and return hypothesis h H that minimizes empirical loss: L S (h) := E (X,y) S [l(y, h(x))] = 1 m m i=1 l(y i, h(x i )) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 5 / 55

6 Perspective: Understanding Deep Learning Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization f * h Training Error (Optimization) * h S * h Estimation Error (Generalization) Approximation Error (Expressiveness) fd ground truth (argmin f Y X L D(f )) hd optimal hypothesis (argmin h H L D (h)) hs empirically optimal hypothesis (argmin h H L S (h)) h returned hypothesis Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 6 / 55

7 Perspective: Understanding Deep Learning Classical Machine Learning h Well developed theory Training Error (Optimization) * h S h Estimation Error (Generalization) * * f Approximation Error (Expressiveness) Optimization Empirical loss minimization is a convex program: h hs ( training err 0 ) Expressiveness & Generalization Bias-variance trade-off: H approximation err estimation err expands shrinks Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 7 / 55

8 Perspective: Understanding Deep Learning Deep Learning h Not well understood Training Error (Optimization) * h S h Estimation Error (Generalization) * * f Approximation Error (Expressiveness) Optimization Empirical loss minimization is a non-convex program: hs is not unique many hypotheses have low training err Stochastic Gradient Descent somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don t W/typical data, solution returned by SGD often generalizes well Expanding H reduces approximation err, but also estimation err! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 8 / 55

9 Convolutional Networks as Hierarchical Tensor Decompositions Outline 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec 17 9 / 55

10 Convolutional Networks as Hierarchical Tensor Decompositions Convolutional Networks Most successful deep learning arch to date! Classic structure: Modern variants: Traditionally used for images/video, nowadays for audio and text as well Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

11 Convolutional Networks as Hierarchical Tensor Decompositions Tensor Product of L 2 Spaces ConvNets realize func over many local elements (e.g. pixels, audio samples) Let R s be the space of such elements (e.g. R 3 for RGB pixels) Consider: L 2 (R s ) space of func over single element L 2 ((R s ) N ) space of func over N elements Fact L 2 ((R s ) N ) is equal to the tensor product of L 2 (R s ) with itself N times: L 2 ((R s ) N ) = L 2 (R s ) L 2 (R s ) }{{} N times Implication If {f d (x)} d=1 is a basis1 for L 2 (R s ), the following is a basis for L 2 ((R s ) N ): { (x 1,..., x N ) } N f i=1 d i (x i ) 1 Set of linearly independent func w/dense span d 1...d N =1 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

12 Convolutional Networks as Hierarchical Tensor Decompositions Coefficient Tensor For practical purposes, restrict L 2 (R s ) basis to a finite set: f 1 (x)...f M (x) We call f 1 (x)...f M (x) descriptors General func over N elements can now be written as: h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) w/func fully determined by the coefficient tensor: N times {}}{ M M A R Example 100-by-100 images (N = 10 4 ) pixels represented by 256 descriptors (M = 256) Then, func over images correspond to coeff tensors of: order 10 4 dim 256 in each mode Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

13 Convolutional Networks as Hierarchical Tensor Decompositions Decomposing Coefficient Tensor Convolutional Arithmetic Circuit h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) Coeff tensor A is exponential (in # of elements N) = directly computing a general func is intractable Observation Applying hierarchical decomposition to coeff tensor gives ConvNet w/linear activation and product pooling (Convolutional Arithmetic Circuit)! decomposition type (mode tree, internal ranks etc) network structure (depth, width, pooling etc) decomposition parameters network weights Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

14 Convolutional Networks as Hierarchical Tensor Decompositions Example 1: CP Decomposition Shallow Network h(x 1,..., x N ) = M W/CP decomposition applied to coeff tensor: d 1...d N =1 A d 1...d N N i=1 f d i (x i ) A = r 0 γ=1 a1,1,y γ a 0,1,γ a 0,2,γ a 0,N,γ func is computed by shallow network (single hidden layer, global pooling): hidden layer input X representation 1x1 conv x i, x 0, j, conv j, a, rep j,: pool conv j, rep i d f d i global pooling M r r 0 0 j covers space dense (output) out y a 1,1, y Y, pool : Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

15 Convolutional Networks as Hierarchical Tensor Decompositions Example 2: HT Decomposition Deep Network h(x 1,..., x N ) = M d 1...d N =1 A d 1...d N N i=1 f d i (x i ) W/Hierarchical Tucker (HT) decomposition applied to coeff tensor: φ 1,j,γ = r 0 α=1 a1,j,γ α a 0,2j 1,α a 0,2j,α φ l,j,γ = r l 1 α=1 al,j,γ α φ l 1,2j 1,α φ l 1,2j,α A = r L 1 φ L 1,1,α φ L 1,2,α α=1 al,1,y α func is computed by deep network w/size-2 pooling windows: hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log 2 N) x i rep i, d f d xi M r 0 0 1x1 conv pooling r rl 1 r L 1 0, j, conv0 j, a, rep j,: pooll 1 convl 1 j ', j ' 1,2 pool0 j, conv0 j ', out y a j ' 2 j 1,2 j L,1, dense (output) Y, pooll 1 : Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55 y

16 Convolutional Networks as Hierarchical Tensor Decompositions Generalization to Other Types of Convolutional Networks We established equivalence: hierarchical tensor decompositions conv arith circuits (ConvACs) ConvACs deliver promising empirical results, 1 but other types of ConvNets (e.g. w/relu activation and max/ave pooling) are much more common The equivalence extends to other types of ConvNets if we generalize the notion of tensor product: 2 Tensor product: (A B) d1...d P+Q = A d1...d P B dp+1...d P+Q Generalized tensor product: (A g B) d1...d P+Q := g(a d1...d P, B dp+1...d P+Q ) (same as but w/general g : R R R instead of mult) 1 Deep SimNets, CVPR 16 ; Tensorial Mixture Models, arxiv 17 2 Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML 16 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

17 Outline Expressive Efficiency 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

18 Expressiveness Expressive Efficiency f * h * h S * h Approximation Error (Expressiveness) fd ground truth (argmin f Y X L D(f )) hd optimal hypothesis (argmin h H L D (h)) hs empirically optimal hypothesis (argmin h H L S (h)) h returned hypothesis Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

19 Expressive Efficiency Expressive Efficiency (informal) Expressive efficiency compares network arch in terms of their ability to compactly represent func Let: H A space of func compactly representable by network arch A H B - - network arch B A is efficient w.r.t. B if H A is a strict superset of H B H A H B A is completely efficient w.r.t. B if H B has zero volume inside H A H A H B Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

20 Expressive Efficiency Expressive Efficiency Formal Definition Network arch A is efficient w.r.t. network arch B if: (1) func realized by B w/size r B can be realized by A w/size r A O(r B ) (2) func realized by A w/size r A requiring B to have size r B Ω(f (r A )), where f ( ) is super-linear A is completely efficient w.r.t. B if (2) holds for all its func but a set of Lebesgue measure zero (in weight space) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

21 Outline Expressive Efficiency Efficiency of Depth 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

22 Efficiency of Depth Expressive Efficiency Efficiency of Depth Longstanding conjecture Efficiency of depth: deep ConvNets realize func that require shallow ConvNets to have exponential size (width) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

23 Expressive Efficiency Efficiency of Depth Tensor Decomposition Viewpoint h(x 1,..., x N ) = M A N d 1...d N =1 d 1...d f N i=1 d i (x i ) Shallow Network input rep conv global pool dense r 0 A = r 0 CP Decomposition γ=1 a1,1,y γ a 0,1,γ a 0,N,γ input rep conv Deep Network pool conv pool dense r0 φ 1,j,γ = α=1 rl 1 φ l,j,γ = α=1 rl 1 A = α=1 HT Decomposition a 1,j,γ α a l,j,γ α a L,1,y α a 0,2j 1,α a 0,2j,α φ l 1,2j 1,α φ l 1,2j,α φ L 1,1,α φ L 1,2,α Efficiency of depth HT decomposition realizes tensors that require CP decomposition to have exponential rank (r 0 exponential in N) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

24 HT vs. CP Analysis Expressive Efficiency Efficiency of Depth Theorem Besides a negligible (zero measure) set, all parameter settings for HT decomposition lead to tensors w/cp-rank exponential in N φ 1,j,γ = r 0 φ l,j,γ = r l 1 A HT Decomposition α=1 a1,j,γ α α=1 al,j,γ α = r L 1 α=1 al,1,y α CP Decomposition a 0,2j 1,α a 0,2j,α φ l 1,2j 1,α φ l 1,2j,α φ L 1,1,α φ L 1,2,α A = r 0 γ=1 a1,1,y γ a 0,1,γ a 0,N,γ Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

25 Expressive Efficiency HT vs. CP Analysis (cont d) Theorem proof sketch Efficiency of Depth A matricization of A (arrangement of tensor as matrix) Kronecker product for matrices. Holds: rank(a B) = rank(a) rank(b) Relation between tensor and Kronecker products: A B = A B Implies: rank A CP-rank(A) By induction over levels of HT, rank A is exponential almost always: Base: SVD has maximal rank almost always Step: rank A B = rank( A B ) = rank A rank B, and linear combination preserves rank almost always HT Decomposition φ 1,j,γ r = 0 α=1 a1,j,γ α a 0,2j 1,α a 0,2j,α φ l,j,γ r = l 1 α=1 al,j,γ α φ l 1,2j 1,α φ l 1,2j,α r A = L 1 α=1 al,1,y α φ L 1,1,α φ L 1,2,α Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

26 Expressive Efficiency HT vs. CP Analysis (cont d) Corollary Efficiency of Depth Randomizing weights of deep ConvAC by a cont distribution leads, w.p. 1, to func that require shallow ConvAC to have exponential # of channels rep i, d f d xi Deep Network hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log2N) x i 1x1 conv pooling ' ' M r 0 r 0 rl 1 r L 1 conv0 j, a rep j,: pool conv j ', 0, j,, L 1 L 1 pool0 j, j 1,2 conv0 j ', out y a 1,2 j L,1, y 2 j j Shallow Network hidden layer input X representation 1x1 conv x i M r0 rep i, d f d xi 0, j, conv j, a, rep j,: pool conv j, j covers space global pooling r0 dense (output) Y out y 1,1, y a, pool : dense (output) Y, pooll 1 : W/ConvACs efficiency of depth holds completely! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

27 Expressive Efficiency Efficiency of Depth HT vs. CP Analysis Generalizations HT vs. CP analysis may be generalized in various ways, e.g.: Comparison between arbitrary depths Penalty in resources is double-exponential w.r.t. # of layers cut-off O ( r N/2) f(l) = O (r 2L l) # of parameters optimal 1 # of layers Adaptation to other types of ConvNets W/ReLU activation and max pooling, deep nets realize func requiring shallow nets to be exponentially large, but not almost always Efficiency of depth is incomplete w/relu ConvNets! L Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

28 Outline Expressive Efficiency Efficiency of Interconnectivity 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

29 Efficiency of Interconnectivity Expressive Efficiency Efficiency of Interconnectivity Classic ConvNets have feed-forward (chain) structure: Modern ConvNets employ elaborate connectivity schemes: Inception (GoogLeNet) ResNet DenseNet Question Can such connectivities lead to expressive efficiency? Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

30 Expressive Efficiency Efficiency of Interconnectivity Dilated Convolutional Networks We focus on dilated ConvNets (D-ConvNets) for sequence data: output L-1 hidden layers input r L rl 1 r 2 r 1 r 0 Time t-2 L t-2 L +1 t-2 L +2 t-3 t-2 t-1 t t+1 1D ConvNets No pooling Dilated (gapped) conv windows N:=2 L time points size-2 conv: dilation-2 L-1 size-2 conv: dilation-2 size-2 conv: dilation-1 Underlie Google s WaveNet & ByteNet state of the art for audio & text! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

31 Expressive Efficiency Dilations and Mode Trees Efficiency of Interconnectivity W/D-ConvNet, mode tree underlying corresponding tensor decomposition determines dilation scheme {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-8 dilation-4 dilation-2 dilation-1 {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,3} {2,4} {5,7} {6,8} {9,11} {10,12} {13,15} {14,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-4 dilation-8 dilation-1 dilation-2 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

32 Mixed Tensor Decompositions Expressive Efficiency Efficiency of Interconnectivity Let: T, T mode trees ; mix(t, T ) set of nodes present in both trees mode tree T {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} mix(t,t) {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} mode tree T {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,3} {2,4} {5,7} {6,8} {9,11} {10,12} {13,15} {14,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} A mixed tensor decomposition blends together T and T by running their decompositions in parallel, exchanging tensors in each node of mix(t, T ) Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

33 Expressive Efficiency Efficiency of Interconnectivity Mixed Dilated Convolutional Networks Mixed tensor decomposition corresponds to mixed D-ConvNet, formed by interconnecting the networks of T and T : output dilation-8 network of T dilation-4 dilation-2 dilation-1 input dilation-2 network of T dilation-1 dilation-8 dilation-4 Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

34 Expressive Efficiency Mixture Expressive Efficiency Theorem Efficiency of Interconnectivity Mixed tensor decomposition of T and T can generate tensors that require individual decompositions to grow quadratically (in terms of their ranks) Corollary Mixed D-ConvNet can realize func that require individual networks to grow quadratically (in terms of layer widths) Experiment 0.72 TIMIT Individual Phoneme Classification Accuracy Validation Set Train Set Connections up to layer Interconnectivity can lead to expressive efficiency! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

35 Outline Inductive Bias via Quantum Entanglement 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

36 Inductive Bias via Quantum Entanglement Inductive Bias Networks of reasonable size can only realize a fraction of all possible func Expressive efficiency does not explain why this fraction is effective all func Why are these functions interesting? H A H B To explain the effectiveness, one must consider the inductive bias: Not all func are equally useful for a given task Network only needs to represent useful func Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

37 Inductive Bias via Quantum Entanglement Inductive Bias via Physics Unlike expressive efficiency, inductive bias can t be studied via math alone it requires reasoning about nature of real-world tasks Physics bears the potential to bridge this gap! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

38 Inductive Bias via Quantum Entanglement Modeling Interactions ConvNets realize func over many local elements (e.g. pixels, audio samples) Key property of such func: interactions modeled between different sets of elements Modeling strong interaction between yellow and blue pixels is important here Less important here Partition A Partition B Questions What kind of interactions do ConvNets model? How do these depend on network structure? Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

39 Inductive Bias via Quantum Entanglement Quantum Entanglement N-1 N In quantum physics, state of particle is represented as vec in Hilbert space: particle state = M a d=1 d }{{} coeff ψ d H }{{} basis System of N particles is represented as vec in tensor product space: system state = M A d 1...d N =1 d 1...d }{{ N ψ d1 ψ dn H H }}{{} coeff tensor N times Quantum entanglement measures quantify interactions that a system state models between sets of particles Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

40 Inductive Bias via Quantum Entanglement Quantum Entanglement (cont d) system state = M d 1...d N =1 A d 1...d N ψ d1 ψ dn N-1 N c Consider partition of the N particles into sets I and I c A I matricization of coeff tensor A w.r.t. I: arrangement of A as matrix rows/cols correspond to modes indexed by I/I c Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

41 Inductive Bias via Quantum Entanglement Quantum Entanglement (cont d) N-1 N c system state = M d 1...d N =1 A d 1...d N ψ d1 ψ dn A I matricization of A w.r.t. I Let σ = (σ 1, σ 2,..., σ R ) be the singular vals of A I Entanglement measures between particles of I and of I c are based on σ: Entanglement Entropy: entropy of (σ 2 1,..., σ2 R )/ σ 2 2 Geometric Measure: 1 σ 2 1 / σ 2 2 Schmidt Number: σ 0 = rank A I Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

42 Inductive Bias via Quantum Entanglement Entanglement with ConvACs Structural equivalence: quantum system (many-body) state M system state = A d1...d N ψ d1 ψ dn }{{} d 1...d N =1 coeff tensor func realized by ConvAC M h(x 1,..., x N ) = A d1...d N f d1 (x 1 ) f dn (x N ) }{{} d 1...d N =1 coeff tensor state of many particles func over many pixels We may quantify interactions ConvAC models between input sets by applying entanglement measures to its coeff tensor! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

43 Inductive Bias via Quantum Entanglement Entanglement with ConvACs Interpretation When func h realized by ConvAC is separable w.r.t. input sets I/I c : g, g s.t. h(x 1,..., x N ) = g ((x i ) i I ) g ((x i ) i I c ) it does not model any interaction between the input sets In a stat setting, this corresponds to independence of (x i ) i I and (x i ) i I c Entanglement measures on coeff tensor of h quantify dist from separability: A has high (low) entanglement w.r.t. I/I c = h is far from (close to) separability w.r.t. I/I c Choice of entanglement measure determines distance metric Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

44 Inductive Bias via Quantum Entanglement Quantum Tensor Networks Coeff tensors of quantum many-body states are simulated via: Tensor Networks Tensor Networks (TNs): Graphs in which: vertices tensors edges modes scalar vector matrix order-3 tensor Edge (mode) connecting two vertices (tensors) represents contraction inner-product between vectors matrix multiplication edges weighted by mode dimensions Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

45 Inductive Bias via Quantum Entanglement ConvACs as Tensor Networks Coeff tensor of ConvAC may be represented via TN: input rep conv pool conv pool dense M r 0 r 0 r L-1 r L-1 open nodes correspond to ConvAC inputs (e.g. pixels) r L-1 r L-1 tree structure corresponds to ConvAC pooling geometry r 1 r 1 r 1 r 1 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 M M M M M M M M rl-1 edge weights correspond to ConvAC layer widths Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

46 Inductive Bias via Quantum Entanglement Entanglement via Minimal Cuts Theorem ( Quantum Max Flow/Min Cut ) Maximal Schmidt entanglement ConvAC models between input sets I/I c is equal to min cut in respective TN separating nodes of I/I c ConvAC entanglement between input sets TN min cut separating respective node sets = c c Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

47 Inductive Bias via Quantum Entanglement Controlling Entanglement (Interactions) Corollary Controlling entanglement (interactions) modeled by ConvAC is equivalent to controlling min cuts in respective TN open nodes correspond to ConvAC inputs (e.g. pixels) r L-1 r 1 r 1 r 1 r 1 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r 0 r L-1 tree structure corresponds to ConvAC pooling geometry M M M M M M M M Two sources of control: layer widths, pooling geometry rl-1 edge weights correspond to ConvAC layer widths We may analyze the effect of ConvAC arch on the interactions (entanglement) it can model! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

48 Inductive Bias via Quantum Entanglement Controlling Interactions Layer Widths Claim Deep (early) layer widths are important for long (short)-range interactions Experiment Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

49 Inductive Bias via Quantum Entanglement Controlling Interactions Pooling Geometry Claim Input elements pooled together early have stronger interaction Experiment closedness: low symmetry: low square pooling (local interactions) closedness: high symmetry: low closedness: low symmetry: high closedness: high symmetry: high mirror pooling (interactions between reflections) closedness task symmetry task Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

50 Outline Conclusion 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

51 Conclusion Conclusion Three pillars of statistical learning theory: Expressiveness Generalization Optimization Well developed theory for classical ML Limited understanding for Deep Learning We derive equivalence: ConvNets hierarchical tensor decompositions and use it to analyze expressive efficiency To understand expressiveness, efficiency is not enough one must consider the inductive bias: This cannot be done via math alone Physics bears the potential to bridge this gap! We use Quantum Entanglement and Tensor Networks to study ConvNets ability to model interactions Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

52 Future Possibilities Conclusion Further studying inductive bias of ConvNets via Quantum Physics: Understanding overlapping operations via MERA disentanglers Characterizing correlations (interactions) in natural data-sets Transfer of computational tools between Deep Learning and Physics: Training ConvNets w/tensor Network algorithms (e.g. DMRG) Quantum Computation (wave function reconstruction) w/sgd AND MUCH MORE... Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

53 Conclusion Recent Development (Levine et al): From ConvNets to Recurrent Neural Networks (RNNs) RNNs most successful Deep Learning arch for sequence processing Start-End Entanglement quantifies long-term memory of a network Authors analyze this via TNs (MPS and generalizations): Show Start-End Entanglement increases exponentially w/depth! Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

54 Outline 1 Perspective: Understanding Deep Learning 2 Convolutional Networks as Hierarchical Tensor Decompositions 3 Expressive Efficiency Efficiency of Depth Efficiency of Interconnectivity 4 Inductive Bias via Quantum Entanglement 5 Conclusion Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

55 Thank You Nadav Cohen (IAS) Understanding Deep Learning via Physics Physics-ML, CUNY, Dec / 55

Expressive Efficiency and Inductive Bias of Convolutional Networks:

Expressive Efficiency and Inductive Bias of Convolutional Networks: Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis & Design via Hierarchical Tensor Decompositions Nadav Cohen The Hebrew University of Jerusalem AAAI Spring Symposium Series

More information

Expressive Efficiency and Inductive Bias of Convolutional Networks:

Expressive Efficiency and Inductive Bias of Convolutional Networks: Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis & Design via Hierarchical Tensor Decompositions Nadav Cohen The Hebrew University of Jerusalem AAAI Spring Symposium Series

More information

Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions Nadav Cohen Ronen Tamari Amnon Shashua Institute for Advanced Study Hebrew University of Jerusalem International Conference on Learning

More information

A TENSOR ANALYSIS ON DENSE CONNECTIVITY VIA CONVOLUTIONAL ARITHMETIC CIRCUITS

A TENSOR ANALYSIS ON DENSE CONNECTIVITY VIA CONVOLUTIONAL ARITHMETIC CIRCUITS A TENSOR ANALYSIS ON DENSE CONNECTIVITY VIA CONVOLUTIONAL ARITHMETIC CIRCUITS Anonymous authors Paper under double-blind review ABSTRACT Several state of the art convolutional networks rely on inter-connecting

More information

arxiv: v5 [cs.lg] 11 Jun 2018

arxiv: v5 [cs.lg] 11 Jun 2018 Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions arxiv:1705.02302v5 [cs.lg] 11 Jun 2018 Nadav Cohen Or Sharir Yoav Levine Ronen Tamari David Yakira Amnon Shashua The

More information

Machine Learning with Quantum-Inspired Tensor Networks

Machine Learning with Quantum-Inspired Tensor Networks Machine Learning with Quantum-Inspired Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 RIKEN AICS - Mar 2017 Collaboration with David

More information

arxiv: v2 [cs.lg] 10 Apr 2017

arxiv: v2 [cs.lg] 10 Apr 2017 Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design arxiv:1704.01552v2 [cs.lg] 10 Apr 2017 Yoav Levine David Yakira Nadav Cohen Amnon Shashua The Hebrew

More information

Machine Learning with Tensor Networks

Machine Learning with Tensor Networks Machine Learning with Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 Beijing Jun 2017 Machine learning has physics in its DNA # " # #

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

Bridging Many-Body Quantum Physics and Deep Learning via Tensor Networks

Bridging Many-Body Quantum Physics and Deep Learning via Tensor Networks Bridging any-body Quantum Physics and Deep Learning via Tensor Networks Yoav Levine, 1, Or Sharir, 1, Nadav Cohen,, and Amnon Shashua 1, 1 The ebrew University of Jerusalem, Israel School of athematics,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Tensor network vs Machine learning. Song Cheng ( 程嵩 ) IOP, CAS

Tensor network vs Machine learning. Song Cheng ( 程嵩 ) IOP, CAS Tensor network vs Machine learning Song Cheng ( 程嵩 ) IOP, CAS physichengsong@iphy.ac.cn Outline Tensor network in a nutshell TN concepts in machine learning TN methods in machine learning Outline Tensor

More information

On the Expressive Power of Deep Learning: A Tensor Analysis

On the Expressive Power of Deep Learning: A Tensor Analysis JMLR: Workshop and Conference Proceedings vol 49:1 31, 2016 On the Expressive Power of Deep Learning: A Tensor Analysis Nadav Cohen Or Sharir Amnon Shashua The Hebrew University of Jerusalem COHENNADAV@CS.HUJI.AC.IL

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2 Outline Failure of Perceptron Neural Network Backpropagation

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

arxiv: v2 [cs.ne] 23 May 2016

arxiv: v2 [cs.ne] 23 May 2016 Convolutional Rectifier Networks as Generalized Tensor Decompositions Nadav Cohen The Hebrew Universit of Jerusalem cohennadav@cs.huji.ac.il Amnon Shashua The Hebrew Universit of Jerusalem shashua@cs.huji.ac.il

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Quantum Convolutional Neural Networks

Quantum Convolutional Neural Networks Quantum Convolutional Neural Networks Iris Cong Soonwon Choi Mikhail D. Lukin arxiv:1810.03787 Berkeley Quantum Information Seminar October 16 th, 2018 Why quantum machine learning? Machine learning: interpret

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep

More information

Tensor networks and deep learning

Tensor networks and deep learning Tensor networks and deep learning I. Oseledets, A. Cichocki Skoltech, Moscow 26 July 2017 What is a tensor Tensor is d-dimensional array: A(i 1,, i d ) Why tensors Many objects in machine learning can

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Unsupervised Learning: Projections

Unsupervised Learning: Projections Unsupervised Learning: Projections CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 2 Data, Data, Data LIBS spectrum Steel'drum' The Image Classification Challenge: 1,000 object classes 1,431,167 images

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning  Ian Goodfellow Last updated Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Nonparametric regression using deep neural networks with ReLU activation function

Nonparametric regression using deep neural networks with ReLU activation function Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20 Many impressive results in applications... Lack of theoretical understanding...

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

More on Neural Networks

More on Neural Networks More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,

More information

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W., Overview 1. Brief tensor introduction 2. Stein s lemma 3. Score and score matching for fitting models 4. Bringing it all together for supervised deep learning Tensor intro 1 Tensors are multidimensional

More information

4 Matrix product states

4 Matrix product states Physics 3b Lecture 5 Caltech, 05//7 4 Matrix product states Matrix product state (MPS) is a highly useful tool in the study of interacting quantum systems in one dimension, both analytically and numerically.

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk

Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk Texas A&M August 21, 2018 Plan Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) Plan 1 Linear models: Baldi-Hornik

More information

Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks. Sargur N. Srihari Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

An Isabelle Formalization of the Expressiveness of Deep Learning

An Isabelle Formalization of the Expressiveness of Deep Learning Universität des Saarlandes Master s Thesis in partial fulfillment of the requirements for the degree of M. Sc. Language Science and Technology An Isabelle Formalization of the Expressiveness of Deep Learning

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Neural Networks. Intro to AI Bert Huang Virginia Tech

Neural Networks. Intro to AI Bert Huang Virginia Tech Neural Networks Intro to AI Bert Huang Virginia Tech Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: back propagation https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

Matrix Product States

Matrix Product States Matrix Product States Ian McCulloch University of Queensland Centre for Engineered Quantum Systems 28 August 2017 Hilbert space (Hilbert) space is big. Really big. You just won t believe how vastly, hugely,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Ben Haeffele and René Vidal Center for Imaging Science Mathematical Institute for Data Science Johns Hopkins University This

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Deep Homogeneous Mixture Models: Representation, Separation, and Approximation

Deep Homogeneous Mixture Models: Representation, Separation, and Approximation Deep Homogeneous Mixture Models: Representation, Separation, and Approximation Priyank Jaini Department of Computer Science & Waterloo AI Institute University of Waterloo pjaini@uwaterloo.ca Pascal Poupart

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Asaf Bar Zvi Adi Hayat. Semantic Segmentation Asaf Bar Zvi Adi Hayat Semantic Segmentation Today s Topics Fully Convolutional Networks (FCN) (CVPR 2015) Conditional Random Fields as Recurrent Neural Networks (ICCV 2015) Gaussian Conditional random

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

Multimodal context analysis and prediction

Multimodal context analysis and prediction Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student) Outline 2 Context analysis vs prediction

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs) TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Multiclass Logistic Regression Multilayer Perceptrons (MLPs) Stochastic Gradient Descent (SGD) 1 Multiclass Classification We consider

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information